Our client, a hypothetical pharmaceutical company, is looking to understand better what the data related to individuals (clients and potential clients) and various health conditions and miscellaneous attributes. The goal is to extract meaningful information that could guide future research and assist with the company rapidly expanding business and market share while focusing on and improving the wellbeing of the clients.
Individual - A person who has been surveyed by the NHMS (National Health Measurement Study) dataset for various attributes related to the following: demographics, examinations, dietary, questionnaire (medical history), and medication.
Health Conditions - Various diseases or ailments that people may inhibit, such as sleep disorders, diabetes, oral health, cholesterol.
The National Health and Nutrition Examination Survey (NHANES) - A program of studies designed to assess the health and nutritional status of adults and children in the United States.
The data gettered is spread into six distinct files (CSV format): Demographics, Examinations, Dietary, Laboratory, Questionnaire, and Medication.
Our client wants to develop new drugs that primary intent to improve the quality of life of the individuals survived. The company is interested as to whether existing data on subjects and their associated health conditions could provide advice and insight to their researcher. They have obtained the NHANES dataset and requested our assistance to perform the intended analysis. This dataset contains individuals data along with various information, including health conditions.
The company is interested in developing new drugs for the following health conditions: diabetes, hypertension (blood pressure), and cancer. <we can add or remove health conditions later. At the very least, let’s keep diabetes or something>
The company, aware of our Machine Learning skills, approached us for help on the following problems:
The company has a mature R&D function in relation to other pharmaceutical companies. However, the researchers are looking for any type of edge or extra tidbit of information that might be beneficial to the researchers to develop their drugs for diabetes, hypertension, and cancer. Also, the clinical trials are increasingly expensive for the development of new drugs for the treatment of diseases. Is there anything in the data that could help us lower the costs?
The marketing department is struggling with high costs of television advertisements and is interested in ways to reduce their costs while still hitting their target markets for both the advertisement of drugs and attracting candidates for trails.
For candidates already enrolled in Diabetes program, is there a way to tell if they have other diseases such as cancer or hypertension with the minimum amount of questions? The cost of performing lab costs could be reduced and if the candidates does have another disease, they could be used as a referral into another drug testing plan for that disease.
The company would ask us about possible wraping the model as a robust, easy to use App that could be present to managment and corporate to assist with the decision making, based on a few user inputs.
The first business problem involves finding commonality between individuals. This is a clustering problem and we will need a model using a clustering algorithm. We need to determine whether the business’s presumption is accurate:
If this is the case, we need to find clusters of subjects that segregate the data by health conditions and report these findings to the business.
The second business problem involves using “health condition” features and finding related features. This is an association problem and we will need a model using an association algorithm. For this, we only need to use diet and demographic data potentially. This is a marketing problem so other attributes may not be applicable.
The third business problem is a supervised machine learning problem. Using the least amount of information, if a individual has diabetes, can we predict whether they have either hypertension or cancer (or both)?
For the association problem, we will need to see which attributes are tied to the “health condition” features. In order to achieve this, we are postulating that the following columns/features of the Questionnaire dataset indicate that an individual has a “health condition”:
DIQ010 - Doctor told you have diabetes https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DIQ_H.htm The next questions are about specific medical conditions. {Other than during pregnancy, {have you/has SP}/{Have you/Has SP}} ever been told by a doctor or health professional that {you have/{he/she/SP} has} diabetes or sugar diabetes?
BPQ020 - Ever told you had high blood pressure https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/BPQ_H.htm {Have you/Has SP} ever been told by a doctor or other health professional that {you/s/he} had hypertension, also called high blood pressure?
MCQ220 - Ever told you had cancer or malignancy https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/MCQ_H.htm#MCQ220 {Have you/Has SP} ever been told by a doctor or other health professional that {you/s/he} had cancer or a malignancy (ma-lig-nan-see) of any kind?
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(mice)
library(scales)
library(randomForest)
library(psych)
library(factoextra)
library(RColorBrewer)
library(caret)
library(plotly)
library(scales)
library(AMR)
As indicated earlier, the dataset consists of six raw data files: Demographics, Examinations, Dietary, Laboratory, Questionnaire, and Medication. The largest dataset, in terms of attributes, contains 953 variables, while the smallest one contains 47 variables.
Because this is a large amount of data, with over a thousand attributes cumulatively, we decided to employ the following guidelines to reduce the complexity of the data:
Ideally, we would like to analyze and impute every attribute with missing values, but in this situation, it may not be practical due to the large volume of missing data.
{eval=FALSE, r, warning = F, message=F} # Reading files demographic = read.csv("Data/Raw/demographic.csv", header = TRUE, na.strings = c("NA","","#NA")) diet = read.csv("Data/Raw/diet.csv", header = TRUE, na.strings = c("NA","","#NA")) examination = read.csv("Data/Raw/examination.csv", header = TRUE, na.strings = c("NA","","#NA")) labs = read.csv("Data/Raw/labs.csv", header = TRUE, na.strings = c("NA","","#NA")) medications = read.csv("Data/Raw/medications.csv", header = TRUE, na.strings = c("NA","","#NA")) questionnaire = read.csv("Data/Raw/questionnaire.csv", header = TRUE, na.strings = c("NA","","#NA")) # Merging files data_list = list(demographic,examination,diet,labs,questionnaire,medications) data_joined = join_all(data_list)
It is always essentialto check for missing values and consider how to addreess them in the model.
We decided to represent the Demographic and Diet datasets as they are mostly complete.
We found that the percentage of missing data in four of the six spreadsheets is very significant. Almost all attributes/columns have varying degrees of missing values.
demographic_MS <- demographic %>% summarise_all(~(sum(is.na(.))/n()))
demographic_MS <- gather(demographic_MS, key = "variables", value = "percent_missing")
demographic_MS <- demographic_MS[demographic_MS$percent_missing > 0.0, ]
demographic_MS_plot <- ggplot(demographic_MS, aes(x = reorder(variables,percent_missing),
y = percent_missing)) +
geom_bar(stat = "identity", fill = "blue", aes(color = I('white')),
size = 0.3, alpha = 0.8)+
xlab('variables')+ coord_flip()+
#theme_fivethirtyeight() +
ggtitle("Demographic Missing Data By Columns")
#demographic_MS_plot
ggsave(plot = demographic_MS_plot, width = 8, height = 4, dpi = 300,
filename = "Figures/demographic_MS_plot.png")
diet_MS <- diet %>% summarise_all(~(sum(is.na(.))/n()))
diet_MS <- gather(diet_MS, key = "variables", value = "percent_missing")
diet_MS <- diet_MS[diet_MS$percent_missing > 0.0, ]
diet_MS_plot <- ggplot(diet_MS, aes(x = reorder(variables, percent_missing),
y = percent_missing)
) +
geom_bar(stat = "identity", fill = "blue", aes(color = I('blue')),
size = 0.3, alpha = 0.8)+
xlab('variables') + coord_flip()+
#theme_fivethirtyeight() +
ggtitle("Diet Missing Data By Columns")+
theme(axis.text.y=element_text(size=3))
#diet_MS_plot
ggsave(plot = diet_MS_plot, width = 8, height = 4, dpi = 300,
filename = "Figures/diet_MS_plot.png")
examination_MS <- examination %>% summarise_all(~(sum(is.na(.))/n()))
examination_MS <- gather(examination_MS, key = "variables", value = "percent_missing")
examination_MS <- examination_MS[examination_MS$percent_missing > 0.0, ]
examination_MS_plot <- ggplot(examination_MS, aes(x = reorder(variables, percent_missing),
y = percent_missing)) +
geom_bar(stat = "identity", fill = "blue", aes(color = I('blue')),
size = 0.3, alpha = 0.8)+
xlab('variables')+ coord_flip()+
#theme_fivethirtyeight() +
ggtitle("Examination Missing Data By Columns")+
theme(axis.text.y=element_text(size=3))
#examination_MS_plot
ggsave(plot = examination_MS_plot, width = 8, height = 4, dpi = 300,
filename = "Figures/examination_MS_plot.png")
labs_MS <- labs %>% summarise_all(~(sum(is.na(.))/n()))
labs_MS <- gather(labs_MS, key = "variables", value = "percent_missing")
labs_MS <- labs_MS[labs_MS$percent_missing > 0.0, ]
labs_MS_plot <- ggplot(labs_MS, aes(x = reorder(variables, percent_missing),
y = percent_missing)
) +
geom_bar(stat = "identity", fill = "blue", aes(color = I('blue')),
size = 0.3, alpha = 0.8)+
xlab('variables') + coord_flip()+
ggtitle("Labs Missing Data By Columns")+
theme(axis.text.y=element_text(size=3))
#labs_MS_plot
ggsave(plot = labs_MS_plot, width = 8, height = 4, dpi = 300,
filename = "Figures/labs_MS_plot.png")
medications_MS <- medications %>% summarise_all(~(sum(is.na(.))/n()))
medications_MS <- gather(medications_MS, key = "variables", value = "percent_missing")
medications_MS <- medications_MS[medications_MS$percent_missing > 0.0, ]
medications_MS_plot <- ggplot(medications_MS, aes(x = reorder(variables, percent_missing),
y = percent_missing)) +
geom_bar(stat = "identity", fill = "blue", aes(color = I('white')),
size = 0.3, alpha = 0.8)+
xlab('variables')+ coord_flip()+
#theme_fivethirtyeight() +
ggtitle("Medications Missing Data By Columns")
#medications_MS_plot
ggsave(plot = medications_MS_plot, width = 8, height = 4, dpi = 300,
filename = "Figures/medications_MS_plot.png")
questionnaire_MS <- questionnaire %>% summarise_all(~(sum(is.na(.))/n()))
questionnaire_MS <- gather(questionnaire_MS, key = "variables", value = "percent_missing")
questionnaire_MS <- questionnaire_MS[questionnaire_MS$percent_missing > 0.0, ]
questionnaire_MS_plot <- ggplot(questionnaire_MS, aes(x = reorder(variables, percent_missing),
y = percent_missing)) +
geom_bar(stat = "identity", fill = "blue", aes(color = I('blue')),
size = 0.3, alpha = 0.8)+
xlab('variables')+ coord_flip()+
ggtitle("Questionnaire Missing Data By Columns")+
theme(axis.text.y=element_text(size=3))
#questionnaire_MS_plot
ggsave(plot = questionnaire_MS_plot, width = 8, height = 4, dpi = 300,
filename = "Figures/questionnaire_MS_plot.png")
As per our guidelines, we will select attributes/columns of interest based on our business/personal judgements. The full NHANES data dictionary/variable list is available at the following URL:
https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2013
We first remove the variables having near zero variance in the dataset.Later we will remove the variables having more that 25% missing values in the dataset for Demographics.
if (length(nearZeroVar(demographic_major, freqCut = 90/2, uniqueCut = 10, saveMetrics = FALSE,
names = FALSE, foreach = FALSE, allowParallel = TRUE)) > 0){
demographic_major <- demographic_major[, -nearZeroVar(demographic_major, freqCut = 90/2, uniqueCut = 10, saveMetrics = FALSE,
names = FALSE, foreach = FALSE, allowParallel = TRUE)]
}
# Check the columns for missing values >25%
sapply(demographic_major, function(x) ((sum(is.na(x))))*.01) %>%
stack %>% rev %>% filter(values > 25) %>% setNames(nm=c("variable", "missing"))
Null_Num <- apply(demographic_major, 2, function(x) length(which(is.na(x) | x == "NA"))/length(x))
Null_Colms <- colnames(demographic_major)[Null_Num > 0.25]
demographic75 <- select(demographic_major, -Null_Colms)
We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:
demographic_indexed <- demographic75
colnames(demographic_indexed) <- with(Dictionary,
Dictionary$Variable.Description[match(colnames(demographic75),
Dictionary$Variable.Name,
nomatch = Dictionary$Variable.Name
)])
Demogramphic_Col_Labels <- data.frame("Code"=c(colnames(demographic75)),
"Desp"=c(colnames(demographic_indexed)))
Categorization of variables
We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again
Cat_demo <- c(0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1)
Demogramphic_Col_Labels <- data.frame(Demogramphic_Col_Labels,Cat = Cat_demo)
Now we prepare the dataset for impute from all the information.
Catcolmn <- Demogramphic_Col_Labels[Demogramphic_Col_Labels$Cat ==2 , 2 ]
Numcolmn <- Demogramphic_Col_Labels[Demogramphic_Col_Labels$Cat ==1 , 2 ]
Catcolmn_Nul <- Demogramphic_Col_Labels[Demogramphic_Col_Labels$Cat ==0 , 2 ]
WorkingColm <- c(Catcolmn_Nul, Numcolmn, Catcolmn)
WorkingColm
)
demographic_selected = demographic75[ Catcolmn_Nul ]
demographic_selected = demographic75[ WorkingColm ]
demographic_selected[, Catcolmn] <- sapply(demographic_selected[, Catcolmn], as.numeric)
demographic_selected[, Catcolmn_Nul] <- sapply(demographic_selected[, Catcolmn_Nul], as.numeric)
demographic_selected[, Numcolmn] <- sapply(demographic_selected[, Numcolmn], as.numeric)
meth = init$method
predM = init$predictorMatrix
predM[, c("SEQN")]=0
meth[Catcolmn_Nul] = ""
meth[Catcolmn]="cart"
meth[Numcolmn]="rf"
set.seed(103)
imputed = mice(demographic_selected, method=meth, predictorMatrix=predM, m=5)
#Create a dataset after imputation.
demographic_imputed<- complete(imputed)
rm(Demogramphic_Col_Labels,demographic75,demographic_selected, imputed)
dir.create("Data/Clean_Imputes")
#write.csv(demographic_imputed , "Data/Clean_Imputes/demographic_imputed.csv",row.names = FALSE)
demographic_imputed = read.csv("Data/Clean_Imputes/demographic_imputed.csv", header = TRUE, na.strings = c("NA","","#NA"))
We have selected the following 8 relevant columns among the 32 that have less than 25% of missing values:
Now we will label the dataset for visualizations.
multi.hist(demo_subset_8_imputed[,sapply(demo_subset_8_imputed, is.numeric)])
demo_subset_8<- demo_subset_8 %>%
rename("ID" = "SEQN",
"Gender" = "RIAGENDR",
"Age" = "RIDAGEYR",
"Race" = "RIDRETH3",
"Country_of_birth" = "DMDBORN4",
"Citizenship_status" = "DMDCITZN",
"Family_members" = "DMDFMSIZ",
"Marital_status" = "DMDHRMAR",
"Family_income" = "INDFMIN2" )
sapply(demo_subset_8, function(x) sum(is.na(x)))
require(dplyr)# because Race is a factor of level 6
demo_subset_8_labeled <- demo_subset_8_labeled %>%
mutate(Race = recode(Race, "1" = "Mexican_American",
"2" = "Other_Hispanic",
"3" = "White",
"4" = "Black",
"6" = "Asian",
"7" = "multiracial"))
demo_subset_8_labeled <- demo_subset_8_labeled %>%
mutate(Country_of_birth = recode(Country_of_birth , "1" = "US",
"2" = "Others",
"77" = "Refused",
"99" = "Uknown"))
demo_subset_8_labeled <- demo_subset_8_labeled %>%
mutate(Citizenship_status = recode(Citizenship_status, "1" = "US",
"2" = "Other",
"7" = "Refused",
"9" = "Unknown"))
demo_subset_8_labeled <- demo_subset_8_labeled %>%
mutate(Marital_status = recode(Marital_status, "1" = "Married",
"2" = "Widowed",
"3" = "Divorced",
"4" = "Separated",
"5" = "Never_married",
"6" = "partner",
"77" = "Refused",
"99" = "Unknown"))
demo_subset_8_labeled <- demo_subset_8_labeled %>%
mutate(Family_income = recode(Family_income, "1" = "$0 - $4999",
"2" = "$5000 - $9999",
"3" = "$10000 - $14999",
"4" = "$15000 - $19999",
"5" = "$20000 - $24999",
"6" = "$25000 - $34999",
"7" = "$35000 - $44999",
"8" = "$45000 - $54999",
"9" = "$55000 - $64999",
"10" = "$65000 - $74999",
"12" = "$20000 and Over",
"13" = "Under $20000",
"14" = "$75000 - $99999",
"15" = "$100000 and Over",
"77" = "Refused",
"99" = "Unknown" ))
demo_subset_8_imputed$Family_income <- as.factor(demo_subset_8_imputed$Family_income)
#write.csv(demo_subset_8_labeled,file = "Data/Working/demo_subset_8_labeled.csv")
########################## Gender #############
Gender <- demo_subset_8_labeled %>%
group_by(Gender) %>%
summarize(count=n()) %>%
arrange(desc(count))
#Pie plot
Gender_plot <- ggplot(Gender, aes(x = "", y = round(100*count/sum(count), 1),
fill = reorder(Gender,count))) +
geom_bar(width = 1, stat = "identity", color = "white") +
coord_polar("y", start = 0)+
geom_text(aes(y = cumsum(100*count/sum(count)) - 0.5*(100*count/sum(count)),
label = paste(round(count/sum(count)*100),"%")), color = "black")+
ggtitle("Pie plot of Gender")+
scale_fill_grey(start = 0.8, end = 0.2,"Gender") + theme_void()
#ggsave(plot = Gender_plot, width = 3, height = 3, dpi = 300,
# filename = "Figures/Gender_plot.png")
########################## Country_of_birth #############
Country_of_birth <- demo_subset_8_labeled %>%
group_by(Country_of_birth) %>%
summarize(count=n()) %>%
arrange(desc(count))%>%
mutate(pct = count / sum(count),
pctlabel = paste0(round(pct*100), "%"),
lab.ypos = 100*cumsum(pct) - 0.5 *100*pct)
#Bar plot
require(scales)
Birth_plot <- ggplot(Country_of_birth, aes(x = reorder(Country_of_birth, -pct),y = pct)) +
geom_bar(stat = "identity", fill = "indianred3", color = "black") +
geom_text(aes(label = pctlabel), vjust = -0.25) +
scale_y_continuous(labels = percent) +
labs(x = "Country of birth", y = "percantage", title = "Bar Chart of Country of birth")
#ggsave(plot = Birth_plot, width = 3, height = 3, dpi = 300,
# filename = "Figures/Birth_plot.png")
######################## Marital_status #######################
Marital_status <- demo_subset_8_labeled %>%
group_by(Marital_status) %>%
summarize(count=n()) %>%
arrange(desc(count))%>%
mutate(pct = count / sum(count),
pctlabel = paste0(round(pct*100), "%"),
lab.ypos = 100*cumsum(pct) - 0.5 *100*pct)
#Bar plot
require(scales)
Marital_plot <- ggplot(Marital_status, aes(x = reorder(Marital_status, -pct),y = pct)) +
geom_bar(stat = "identity", fill = "indianred3", color = "black") +
geom_text(aes(label = pctlabel), vjust = -0.25) +
scale_y_continuous(labels = percent) +
labs(x = "Marital statush", y = "percantage", title = "Bar Chart of Marital status in US ") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#ggsave(plot = Marital_plot, width = 3, height = 3, dpi = 300,
# filename = "Figures/Marital_plot.png")
######################## Race #######################
Race <- demo_subset_8_labeled %>%
group_by(Race) %>%
summarize(count=n()) %>%
arrange(desc(count))%>%
mutate(pct = count / sum(count),
pctlabel = paste0(round(pct*100), "%"),
lab.ypos = 100*cumsum(pct) - 0.5 *100*pct)
#Bar plot
require(scales)
Race_plot <- ggplot(Race, aes(x = reorder(Race, -pct),y = pct)) +
geom_bar(stat = "identity", fill = "indianred3", color = "black") +
geom_text(aes(label = pctlabel), vjust = -0.25) +
scale_y_continuous(labels = percent) +
labs(x = "Race", y = "percantage", title = "Bar Chart of Race in US ") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#ggsave(plot = Race_plot, width = 3, height = 3, dpi = 300,
# filename = "Figures/Race_plot.png")
Our samples is pretty representative of the US population:
First we would remove all the Near Zero Variance features from the data set, Cutt off being 45% :
diet_major <- diet
if (length(nearZeroVar(diet_major, freqCut = 90/2, uniqueCut = 10, saveMetrics = FALSE,
names = FALSE, foreach = FALSE, allowParallel = TRUE)) > 0){
diet_major <- diet_major[, -nearZeroVar(diet_major, freqCut = 90/2, uniqueCut = 10, saveMetrics = FALSE,
names = FALSE, foreach = FALSE, allowParallel = TRUE)]
}
Now, we will remove the features having a missing values of more that 25% as decided before:
Null_Num_diet <- apply(diet_major, 2, function(x) length(which(x == "" | is.na(x) | x == "NA" | x == "-999" ))/length(x))
Null_Colms_diet <- colnames(diet_major)[Null_Num_diet > 0.25]
diet75 <- select(diet_major, -Null_Colms_diet)
colSums(is.na(diet75))
diet75 %>% summarise_all(~(sum(is.na(.))/n()*100))
We have selected the following 69 relevant columns among the 88 that have less than 25% of missing values:
We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:
diet_indexed <- diet75
colnames(diet_indexed) <- with(Dictionary,
Dictionary$Variable.Description[match(colnames(diet75),
Dictionary$Variable.Name,
nomatch = Dictionary$Variable.Name
)])
diet_Col_Labels <- data.frame("Code"=c(colnames(diet75)),
"Desp"=c(colnames(diet_indexed)))
Categorization of variables
We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again
Cat_diet <- c(0,1,1,2,2,2,1,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
diet_Col_Labels <- data.frame(diet_Col_Labels,Cat = Cat_diet)
diet_Col_Labels = read.csv("Data/Labels/diet_Col_Labels.csv", header = TRUE, na.strings = c("NA","","#NA"))
diet_Col_Labels[, 2] <- sapply(diet_Col_Labels[, 2], as.character)
Now we prepare the dataset for impute from all the information.
Catcolmn_diet <- diet_Col_Labels[diet_Col_Labels$Cat ==2 , 2 ]
Numcolmn_diet <- diet_Col_Labels[diet_Col_Labels$Cat ==1 , 2 ]
Catcolmn_Nul_diet <- diet_Col_Labels[diet_Col_Labels$Cat ==0 , 2 ]
WorkingColm_diet <- c(Catcolmn_Nul_diet, Numcolmn_diet, Catcolmn_diet)
meth_diet = init_diet$method
predM_diet = init_diet$predictorMatrix
predM_diet[, c("SEQN")]=0
meth_diet[Catcolmn_Nul_diet] = ""
meth_diet[Catcolmn_diet]="cart"
meth_diet[Numcolmn_diet]="pmm"
set.seed(256)
imputed_diet = mice(diet_selected, method=meth_diet, predictorMatrix=predM_diet, m=5)
#Create a dataset after imputation.
diet_imputed<- complete(imputed_diet)
####################################### Saving Impute
#write.csv(diet_imputed , "Data/Clean_Imputes/diet_imputed.csv",row.names = FALSE)
diet_imputed = read.csv("Data/Clean_Imputes/diet_imputed.csv", header = TRUE, na.strings = c("NA","","#NA"))
Labeling the dataset:
diet_labeled <- diet_imputed
rm(diet_imputed)
diet_labeled <- diet_labeled %>%
dplyr::rename("ID" = "SEQN",
"Carbs_diet" = "DR1TCARB",
"Sugar_diet" = "DR1TSUGR",
"Fiber_diet" = "DR1TFIBE",
"transfat_diet" = "DR1TTFAT",
"satfat_diet" = "DR1TSFAT",
"zinc_diet" = "DR1TZINC",
"copper_diet" = "DR1TCOPP",
"sodium_diet" = "DR1TSODI",
"pota_diet" = "DR1TPOTA",
"selenium_diet" = "DR1TSELE" )
First we would remove all the Near Zero Variance features from the data set, Cutt off being 45% :
exam_major <- examination
rm(examination)
if (length(nearZeroVar(exam_major, freqCut = 90/2, uniqueCut = 10, saveMetrics = FALSE,
names = FALSE, foreach = FALSE, allowParallel = TRUE)) > 0){
exam_major <- exam_major[, -nearZeroVar(exam_major, freqCut = 90/2, uniqueCut = 10, saveMetrics = FALSE,
names = FALSE, foreach = FALSE, allowParallel = TRUE)]
}
Now, we will remove the features having a missing values of more that 25% as decided before:
Null_Num_diet <- apply(diet_major, 2, function(x) length(which(x == "" | is.na(x) | x == "NA" | x == "-999" ))/length(x))
Null_Colms_diet <- colnames(diet_major)[Null_Num_diet > 0.25]
diet75 <- select(diet_major, -Null_Colms_diet)
colSums(is.na(diet75))
diet75 %>% summarise_all(~(sum(is.na(.))/n()*100))
We have selected the following 12 relevant columns among the 105 that have less than 25% of missing values:
We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:
exam_indexed <- exam75
colnames(exam_indexed) <- with(Dictionary,
Dictionary$Variable.Description[match(colnames(exam75),
Dictionary$Variable.Name,
nomatch = Dictionary$Variable.Name
)])
exam_Col_Labels <- data.frame("Code"=c(colnames(exam75)),
"Desp"=c(colnames(exam_indexed)))
Categorization of variables
We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again
Cat_exam <- c(0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1)
exam_Col_Labels <- data.frame(exam_Col_Labels,Cat = Cat_exam)
Now we prepare the dataset for impute from all the information.
Catcolmn_exam <- exam_Col_Labels[exam_Col_Labels$Cat ==2 , 2 ]
Numcolmn_exam <- exam_Col_Labels[exam_Col_Labels$Cat ==1 , 2 ]
Catcolmn_Nul_exam <- exam_Col_Labels[exam_Col_Labels$Cat ==0 , 2 ]
WorkingColm_exam <- c(Catcolmn_Nul_exam, Numcolmn_exam, Catcolmn_exam)
meth_exam = init_exam$method
predM_exam = init_exam$predictorMatrix
predM_exam[, c("SEQN")]=0
meth_exam[Catcolmn_Nul_exam] = ""
meth_exam[Catcolmn_exam]="cart"
meth_exam[Numcolmn_exam]="pmm"
set.seed(311)
imputed_exam = mice(exam_selected, method=meth_exam, predictorMatrix=predM_exam, m=5)
#Create a dataset after imputation.
exam_imputed<- complete(imputed_exam)
rm(imputed_exam)
####################################### Saving Impute
#write.csv(exam_imputed , "Data/Clean_Imputes/exam_imputed.csv",row.names = FALSE)
exam_imputed = read.csv("Data/Clean_Imputes/exam_imputed.csv", header = TRUE, na.strings = c("NA","","#NA"))
Labeling the dataset:
exam_labeled <- exam_imputed
rm(exam_imputed)
exam_labeled = dplyr::rename(
exam_labeled,
"ID" = "SEQN",
"BP_test_time_exam" = "PEASCTM1",
"BP_arm_exam" = "BPAARM",
"BP_Systolic_exam" = "BPXSY2",
"BP_Diastolic_exam" = "BPXDI2",
"Weight_exam" = "BMXWT",
"Height_exam" = "BMXHT",
"Leg_length_exam" = "BMXBMI",
"Arm_length_exam" = "BMXLEG",
"Waist_circumference_exam" = "BMXWAIST",
"Dominant_hand_exam" = "MGD130",
"Grip_strength_exam" = "MGDCGSZ"
)
exam_labeled = mutate(
exam_labeled,
BP_arm_exam = recode(BP_arm_exam,
"1" = "Left",
"2" = "Right"),
Dominant_hand_exam = recode(Dominant_hand_exam,
"1"="Right",
"2"="Left",
"3"="Neither")
)
exam_labeled[ , 70:97] <- lapply(exam_labeled[ ,70:97] , FUN = function(x) recode(x, "1='D';2='E';3='J';4='K';5='M';6='P';7='Q';8='R';9='S';10='T';11='U';12='X';13='Y';14='Z'"))
First we would remove all the Near Zero Variance features from the data set, Cutt off being 45% :
labsdata_major <- labs
if (length(nearZeroVar(labsdata_major, freqCut = 90/2, uniqueCut = 10, saveMetrics = FALSE,
names = FALSE, foreach = FALSE, allowParallel = TRUE)) > 0){
labsdata_major <- labsdata_major[, -nearZeroVar(labsdata_major, freqCut = 90/2, uniqueCut = 10, saveMetrics = FALSE,
names = FALSE, foreach = FALSE, allowParallel = TRUE)]
}
Now, we will remove the features having a missing values of more that 25% as decided before:
Null_Num_labsdata <- apply(labsdata_major, 2, function(x) length(which(x == "" | is.na(x) | x == "NA" | x == "-999" ))/length(x))
Null_Colms_labsdata <- colnames(labsdata_major)[Null_Num_labsdata > 0.35]
labsdata75 <- select(labsdata_major, -Null_Colms_labsdata)
We have selected the following 9 relevant columns among the 46 that have less than 25% of missing values:
We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:
labsdata_indexed <- labsdata75
colnames(labsdata_indexed) <- with(Dictionary,
Dictionary$Variable.Description[match(colnames(labsdata75),
Dictionary$Variable.Name,
nomatch = Dictionary$Variable.Name
)])
labsdata_Col_Labels <- data.frame("Code"=c(colnames(labsdata75)),
"Desp"=c(colnames(labsdata_indexed)))
Categorization of variables
We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again
Cat_labs <- c(0,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,1,1,2,2,2,2,1,1,1,1,1)
labsdata_Col_Labels <- data.frame(labsdata_Col_Labels,Cat = Cat_labs)
Now we prepare the dataset for impute from all the information.
Catcolmn_labsdata <- labsdata_Col_Labels[labsdata_Col_Labels$Cat ==2 , 2 ]
Numcolmn_labsdata <- labsdata_Col_Labels[labsdata_Col_Labels$Cat ==1 , 2 ]
Catcolmn_Nul_labsdata <- labsdata_Col_Labels[labsdata_Col_Labels$Cat ==0 , 2 ]
WorkingColm_labsdata <- c(Catcolmn_Nul_labsdata, Numcolmn_labsdata, Catcolmn_labsdata)
meth_labsdata = init_labsdata$method
predM_labsdata = init_labsdata$predictorMatrix
predM_labsdata[, c("SEQN")]=0
meth_labsdata[Catcolmn_Nul_labsdata] = ""
meth_labsdata[Catcolmn_labsdata]="cart"
meth_labsdata[Numcolmn_labsdata]="pmm"
set.seed(415)
imputed_labsdata = mice(labsdata_selected, method=meth_labsdata, predictorMatrix=predM_labsdata, m=5)
labsdata_imputed<- complete(imputed_labsdata)
rm(imputed_labsdata)
#Check for missings in the imputed dataset.
sapply(labsdata_imputed, function(x) sum(is.na(x)))
####################################### Saving Impute
#write.csv(labsdata_imputed , "Data/Clean_Imputes/labsdata_imputed.csv",row.names = FALSE)
labsdata_imputed = read.csv("Data/Clean_Imputes/labsdata_imputed.csv", header = TRUE, na.strings = c("NA","","#NA"))
Labeling the dataset:
labs_labeled <- labs_labeled %>%
dplyr::rename("ID" = "SEQN",
"White_blood_cells_labs" = "LBXWBCSI",
"Red_bloods_cells_labs" = "LBXRBCSI",
"Caffeine_labs" = "PHQ020",
"Alcohol_labs" = "PHQ030",
"Supplements_labs" = "PHQ060",
"Hepatitis_a_labs" = "LBXHA",
"Hepatitis_b_labs" = "LBXHBC",
"Cholesterol_labs" = "LBXTC" )
labs_labeled = labs_labeled %>%
mutate(Caffeine_labs= recode(Caffeine_labs, "1" = "Yes",
"2" = "No",
"NA" = "Not Tested"))
labs_labeled = labs_labeled %>%
mutate(Alcohol_labs= recode(Alcohol_labs, "1" = "Yes",
"2" = "No",
"NA" = "Not Tested"))
labs_labeled = labs_labeled %>%
mutate(Supplements_labs= recode(Supplements_labs, "1" = "Yes",
"2" = "No",
"NA" = "Not Tested"))
labs_labeled = labs_labeled %>%
mutate(Hepatitis_a_labs= recode(Hepatitis_a_labs, "1" = "Positive",
"2" = "Negative",
"3" = "Indeterminate",
"NA" = "Not Tested"))
labs_labeled = labs_labeled %>%
mutate(Hepatitis_b_labs= recode(Hepatitis_b_labs, "1" = "Positive",
"2" = "Negative",
"NA" = "Not Tested"))
First we would remove all the Near Zero Variance features from the data set, Cutt off being 45% :
medsdata_major <- medications
if (length(nearZeroVar(medsdata_major, freqCut = 90/2, uniqueCut = 10, saveMetrics = FALSE,
names = FALSE, foreach = FALSE, allowParallel = TRUE)) > 0){
medsdata_major <- medsdata_major[, -nearZeroVar(medsdata_major, freqCut = 90/2, uniqueCut = 10, saveMetrics = FALSE,
names = FALSE, foreach = FALSE, allowParallel = TRUE)]
}
Now, we will remove the features having a missing values of more that 32% as decided before:
Null_Num_medsdata <- apply(medsdata_major, 2, function(x) length(which(x == "" | is.na(x) | x == "NA" | x == "-999" ))/length(x))
Null_Colms_medsdata <- colnames(medsdata_major)[Null_Num_medsdata > 0.33]
medsdata68 <- select(medsdata_major, -Null_Colms_medsdata)
All of the columns had more than 25% missing values. Among the 8 columns with less than 32% of missing value we have selected the following 5 relevant columns:
We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:
medsdata_indexed <- medsdata68
colnames(medsdata_indexed) <- with(Dictionary,
Dictionary$Variable.Description[match(colnames(medsdata68),
Dictionary$Variable.Name,
nomatch = Dictionary$Variable.Name
)])
medsdata_Col_Labels <- data.frame("Code"=c(colnames(medsdata68)),
"Desp"=c(colnames(medsdata_indexed)))
Categorization of variables
We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again
Cat_meds <- c(0,2,2,2,2,1,2,2,1)
Cat_meds
medsdata_Col_Labels <- data.frame(medsdata_Col_Labels,Cat = Cat_meds)
write.csv(medsdata_Col_Labels,file = "Data/Labels/medsdata_Col_Labels.csv")
medsdata_Col_Labels = read.csv("Data/Labels/medsdata_Col_Labels.csv", header = TRUE, na.strings = c("NA","","#NA"))
medsdata_Col_Labels[, 2] <- sapply(medsdata_Col_Labels[, 2], as.character)
Now we prepare the dataset for impute from all the information.
Catcolmn_medsdata <- medsdata_Col_Labels[medsdata_Col_Labels$Cat ==2 , 2 ]
Numcolmn_medsdata <- medsdata_Col_Labels[medsdata_Col_Labels$Cat ==1 , 2 ]
Catcolmn_Nul_medsdata <- medsdata_Col_Labels[medsdata_Col_Labels$Cat ==0 , 2 ]
WorkingColm_medsdata <- c(Catcolmn_Nul_medsdata, Numcolmn_medsdata, Catcolmn_medsdata)
predM_medsdata[, c("SEQN")]=0
meth_medsdata[Catcolmn_Nul_medsdata] = ""
meth_medsdata[Catcolmn_medsdata]="rf"
meth_medsdata[Numcolmn_medsdata]="pmm"
set.seed(256)
imputed_medsdata = mice(medsdata_selected, method=meth_medsdata, predictorMatrix=predM_medsdata, m=5)
medsdata_imputed<- complete(imputed_medsdata)
# Saving Impute
write.csv(medsdata_imputed , "Data/Working/medsdata_imputed.csv")
Labeling the dataset:
meds_subset_labelled <- medsdata_imputed_subset
colnames(meds_subset_labelled) <- with(Dictionary,
Dictionary$Variable.Description[match(colnames(medsdata_imputed_subset),
Dictionary$Variable.Name,
nomatch = Dictionary$Variable.Name
)])
str(meds_subset_labelled)
write.csv(meds_subset_labelled,file = "meds_subset_labelled.csv")
First, we will remove the near zero vairiance variables.
ques_data_major <- questionnaire
if (length(nearZeroVar(ques_data_major, freqCut = 90/2, uniqueCut = 10, saveMetrics = FALSE,
names = FALSE, foreach = FALSE, allowParallel = TRUE)) > 0){
ques_data_major <- ques_data_major[, -nearZeroVar(ques_data_major, freqCut = 90/2, uniqueCut = 10, saveMetrics = FALSE,
names = FALSE, foreach = FALSE, allowParallel = TRUE)]
}
Now, we will remove the features having a missing values of more that 25% as decided before:
Null_Num_ques_data <- apply(ques_data_major, 2, function(x) length(which(x == "" | is.na(x) | x == "NA" | x == "-999" ))/length(x))
Null_Colms_ques_data <- colnames(ques_data_major)[Null_Num_ques_data > 0.25]
ques_data75 <- select(ques_data_major, -Null_Colms_ques_data)
colSums(is.na(ques_data75))
ques_data75 %>% summarise_all(~(sum(is.na(.))/n()*100))
We have selected the following 38 relevant columns among the 79 that have less than 25% of missing values: * ID (SEQN) - Respondent sequence number. * Spent_total (CBD070) - The next questions are about how much money {your family spends/you spend} on food. First I’ll ask you about money spent at supermarkets or grocery stores. Then we will talk about money spent at other types of stores. During the past 30 days, how much money {did your family/did you} spend at supermarkets or grocery stores? Please include purchases made with food stamps.. * Spent_groceries (CBD110) - About how much money {did your family/did you} spend on food at these types of stores? (Please do not include any stores you have already told me about.). * Spent_vending (CBD120) - During the past 30 days, how much money {did your family/did you} spend on eating out? Please include money spent in cafeterias at work or at school or on vending machines, for all family members.. * Spent_delivered (CBD130) - During the past 30 days, how much money {did your family/did you} spend on food carried out or delivered? Please do not include money you have already told me about.. * Cold_30 (HSQ500) - Did {you/SP} have a head cold or chest cold that started during those 30 days?. * Stomach_30 (HSQ510) - Did {you/SP} have a stomach or intestinal illness with vomiting or diarrhea that started during those 30 days?. * Flu_30 (HSQ520) - Did {you/SP} have flu, pneumonia, or ear infections that started during those 30 days?. * Diabetes (DIQ010) - The next questions are about specific medical conditions. {Other than during pregnancy, {have you/has SP}/{Have you/Has SP}} ever been told by a doctor or health professional that {you have/{he/she/SP} has} diabetes or sugar diabetes?. * Taking_insuline (DIQ050) - {Is SP/Are you} now taking insulin. * Milk_30 (DBQ197) - Now I’m going to ask a few questions about milk products. Do not include their use in cooking. In the past 30 days, how often did {you/SP} have milk to drink or on {your/his/her} cereal? Please include chocolate and other flavored milks as well as hot cocoa made with milk. Do not count small amounts of milk added to coffee or tea. Would you say…. * Meals_outside (DBD895) - Next I’m going to ask you about meals. By meal, I mean breakfast, lunch and dinner. During the past 7 days, how many meals {did you/did SP} get that were prepared away from home in places such as restaurants, fast food places, food stands, grocery stores, or from vending machines? {Please do not include meals provided as part of the school lunch or school breakfast./Please do not include meals provided as part of the community programs you reported earlier.}. * Meals_premade (DBD905) - Some grocery stores sell “ready to eat” foods such as salads, soups, chicken, sandwiches and cooked vegetables in their salad bars and deli counters. During the past 30 days, how often did {you/SP} eat “ready to eat” foods from the grocery store? Please do not include sliced meat or cheese you buy for sandwiches and frozen or canned foods.. * Meals_frozen (DBD910) - During the past 30 days, how often did you {SP} eat frozen meals or frozen pizzas? Here are some examples of frozen meals and frozen pizzas.. * Deafness (DLQ010) - With this next set of questions, we want to learn about people who have physical, mental, or emotional conditions that cause serious difficulties with their daily activities. Though different, these questions may sound similar to ones I asked earlier. {Are you/Is SP} deaf or {do you/does he/does she} have serious difficulty hearing?. * Blindness (DLQ020) - {Are you/Is SP} blind or {do you/does he/does she} have serious difficulty seeing even when wearing glasses?. * Forgetfulness (DLQ040) - Because of a physical, mental, or emotional condition, {do you/does he/does she} have serious difficulty concentrating, remembering, or making decisions?. * Food_assistance (FSD151) - In the last 12 months, did {you/you or any member of your household} ever get emergency food from a church, a food pantry, or a food bank, or eat in a soup kitchen?. * WIC_assistance (FSQ162) - In the last 12 months, did {you/you or any member of your household} receive benefits from the WIC program, that is, the Women, Infants and Children program?. * Hepatitis_b (HEQ010) - Has a doctor or other health professional ever told {you/SP} that {you have/s/he/SP has} Hepatitis B? (Hepatitis is a form of liver disease. Hepatitis B is an infection of the liver from the Hepatitis B virus (HBV).). * Hepatitis_c (HEQ030) - Has a doctor or other health professional ever told {you/SP} that {you have/s/he/SP has} Hepatitis C? (Hepatitis is a form of liver disease. Hepatitis C is an infection of the liver from the Hepatitis C virus (HCV).). * Insurance_current (HIQ011) - The (first/next) questions are about health insurance. {Are you/Is SP} covered by health insurance or some other kind of health care plan? [Include health insurance obtained through employment or purchased directly as well as government programs like Medicare and Medicaid that provide medical care or help pay medical bills.]. * Insurance_lapse_12 (HIQ210) - In the past 12 months, was there any time when {you/SP} did not have any health insurance coverage?. * House_rooms (HOD050) - How many rooms are in this home? Count the kitchen but not the bathroom.. * Health_current (HUQ010) - {First/Next} I have some general questions about {your/SP’s} health. Would you say {your/SP’s} health in general is . . .. * Health_institution (HUQ041) - {What kind of place is it - a clinic, doctor’s office, emergency room, or some other place?} {What kind of place {do you/does SP} go to most often - a clinic, doctor’s office, emergency room, or some other place?}. * Doctor_visits_12 (HUQ051) - {During the past 12 months, how/How} many times {have you/has SP} seen a doctor or other health care professional about {your/his/her} health at a doctor’s office, a clinic or some other place? Do not include times {you were/s/he was} hospitalized overnight, visits to hospital emergency rooms, home visits or telephone calls.. * Health_mental_12 (HUQ090) - During the past 12 months, that is since {DISPLAY CURRENT MONTH} of {DISPLAY LAST YEAR}, {have you/has SP} seen or talked to a mental health professional such as a psychologist, psychiatrist, psychiatric nurse or clinical social worker about {your/his/her} health?. * Family_income_mo (IND235) - Monthly family income (reported as a range value in dollars).. * Asthma (MCQ010) - The following questions are about different medical conditions. Has a doctor or other health professional ever told {you/SP} that {you have/s/he/SP has} asthma (az-ma)?. * Anemia (MCQ053) - During the past 3 months, {have you/has SP} been on treatment for anemia (a-nee-me-a), sometimes called “tired blood” or “low blood”? [Include diet, iron pills, iron shots, transfusions as treatment.]. * Celiac (MCQ082) - Has a doctor or other health professional ever told {you/SP} that {you have/s/he/SP has} celiac (sele-ak) disease, also called or sprue (sproo)?. * Gluten_free_diet (MCQ086) - {Are you/is SP} on a gluten-free diet?. * Jaundice (MCQ203) - Has anyone ever told {you/SP} that {you/she/he/SP} had yellow skin, yellow eyes or jaundice? Please do not include infant jaundice, which is common during the first weeks after birth.. * Asthma_relatives (MCQ300B) - Including living and deceased, were any of {SP’s/your} close biological that is, blood relatives including father, mother, sisters or brothers, ever told by a health professional that they had asthma (az-ma)?. * Dentist_visit_since (OHQ030) - The next questions are about {your/SP’s} teeth and gums. About how long has it been since {you/SP} last visited a dentist? Include all types of dentists, such as, orthodontists, oral surgeons, and all other dental specialists, as well as dental hygienists.. * TV_30 (PAQ710) - Now I will ask you first about TV watching and then about computer use. Over the past 30 days, on average how many hours per day did {you/SP} sit and watch TV or videos? Would you say . . .. * Gaming_hours (PAQ715) - Over the past 30 days, on average how many hours per day did {you/SP} use a computer or play computer games outside of school? Include Playstation, Nintendo DS, or other portable video games Would you say . . .. * Smoking_relatives (SMD460) - Now I would like to ask you a few questions about smoking in this home. How many people who live here smoke cigarettes, cigars, little cigars, pipes, water pipes, hookah, or any other tobacco product?. * Ride_motor_vehicle (SMQ870) - During the last 7 days, did {you/SP} ride in a car or motor vehicle?.
We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:
ques_data_indexed <- ques_data75
colnames(ques_data_indexed) <- with(Dictionary,
Dictionary$Variable.Description[match(colnames(ques_data75),
Dictionary$Variable.Name,
nomatch = Dictionary$Variable.Name
)])
ques_data_Col_Labels <- data.frame("Code"=c(colnames(ques_data75)),
"Desp"=c(colnames(ques_data_indexed)))
#dir.create("Data/Labels")
write.csv(ques_data_Col_Labels,file = "Data/Labels/ques_data_Col_Labels.csv")
Categorization of variables
We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again
# Categorization of variables
Cat_ques <- c(0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
ques_data_Col_Labels <- data.frame(ques_data_Col_Labels,Cat = Cat_ques)
write.csv(ques_data_Col_Labels,file = "Data/Labels/ques_data_Col_Labels.csv")
ques_data_Col_Labels = read.csv("Data/Labels/ques_data_Col_Labels.csv", header = TRUE, na.strings = c("NA","","#NA"))
ques_data_Col_Labels[, 2] <- sapply(ques_data_Col_Labels[, 2], as.character)
Now we prepare the dataset for impute from all the information.
Catcolmn_ques_data <- ques_data_Col_Labels[ques_data_Col_Labels$Cat ==2 , 2 ]
Numcolmn_ques_data <- ques_data_Col_Labels[ques_data_Col_Labels$Cat ==1 , 2 ]
Catcolmn_Nul_ques_data <- ques_data_Col_Labels[ques_data_Col_Labels$Cat ==0 , 2 ]
WorkingColm_ques_data <- c(Catcolmn_Nul_ques_data, Numcolmn_ques_data, Catcolmn_ques_data)
ques_data_selected = ques_data75[ WorkingColm_ques_data ]
ques_data_selected[, Catcolmn_ques_data] <- sapply(ques_data_selected[, Catcolmn_ques_data], as.numeric)
ques_data_selected[, Catcolmn_Nul_ques_data] <- sapply(ques_data_selected[, Catcolmn_Nul_ques_data], as.factor)
ques_data_selected[, Numcolmn_ques_data] <- sapply(ques_data_selected[, Numcolmn_ques_data], as.numeric)
init_ques_data = mice(ques_data_selected, maxit=0)
meth_ques_data = init_ques_data$method
predM_ques_data = init_ques_data$predictorMatrix
predM_ques_data[, c("SEQN")]=0
meth_ques_data[Catcolmn_Nul_ques_data] = ""
meth_ques_data[Catcolmn_ques_data]="cart"
meth_ques_data[Numcolmn_ques_data]="pmm"
set.seed(415)
imputed_ques_data = mice(ques_data_selected, method=meth_ques_data, predictorMatrix=predM_ques_data, m=5)
ques_data_imputed<- complete(imputed_ques_data)
write.csv(ques_data_imputed , "Data/Working/ques_data_imputed.csv")
Now we label and save the data set:
ques_Yes_No_NO_SEQN <- c("HSQ500","HSQ510","HSQ520","DIQ010","DIQ050","DLQ010","DLQ020","DLQ040","FSD151","FSQ162","HIQ011","HIQ210","HUQ090","MCQ010","MCQ053","MCQ300B","SMQ870")
ques_data_imputed_subset[ , ques_Yes_No_NO_SEQN ][ ques_data_imputed_subset[ , ques_Yes_No_NO_SEQN ] == "1" ] <- "Yes"
ques_data_imputed_subset[ , ques_Yes_No_NO_SEQN ][ ques_data_imputed_subset[ , ques_Yes_No_NO_SEQN ] == "2" ] <- "No"
ques_data_imputed_subset[ , ques_Yes_No_NO_SEQN ][ ques_data_imputed_subset[ , ques_Yes_No_NO_SEQN ] == "7" ] <- "Refused"
ques_data_imputed_subset[ , ques_Yes_No_NO_SEQN ][ ques_data_imputed_subset[ , ques_Yes_No_NO_SEQN ] == "9" ] <- "Unknown"
ques_data_numeric1 <-c("CBD070", "CBD110","CBD120","CBD130")
ques_data_imputed_subset[ , ques_data_numeric1 ][ ques_data_imputed_subset[ , ques_data_numeric1 ] == "777777" ] <- "Refused"
ques_data_imputed_subset[ , ques_data_numeric1 ][ ques_data_imputed_subset[ , ques_data_numeric1 ] == "999999" ] <- "Unknown"
ques_data_numeric2 <-c("DBD895", "DBD905","DBD910","CBD130")
ques_data_imputed_subset[ , ques_data_numeric2 ][ ques_data_imputed_subset[ , ques_data_numeric2 ] == "0" ] <- "None"
ques_data_imputed_subset[ , ques_data_numeric2 ][ ques_data_imputed_subset[ , ques_data_numeric2 ] == "7777" ] <- "Refused"
ques_data_imputed_subset[ , ques_data_numeric2 ][ ques_data_imputed_subset[ , ques_data_numeric2 ] == "9999" ] <- "Unknown"
ques_data_imputed_subset[ , "DBD895" ][ ques_data_imputed_subset[ , "DBD895" ] == "5555" ] <- "More than 21 meals per week"
ques_data_imputed_subset <- ques_data_imputed_subset %>%
mutate(HUQ010 = recode(HUQ010 ,
"1" = "Excellent" ,
"2" = "Very good" ,
"3"= "Good" ,
"4"= "Fair" ,
"5" = "Poor" ,
"7"= "Refused" ,
"9"= "Unknown"))
ques_data_imputed_subset <- ques_data_imputed_subset %>%
mutate(DBQ197 = recode(DBQ197 ,
"0"= "Never",
"1"= "Rarely-less than once a week",
"2"= "Sometimes-once a week or more, but less than once a day",
"3"= "Often-once a day or more?",
"4"= "Varied",
"7"= "Refused",
"9"= "Unknown"))
ques_data_imputed_subset <- ques_data_imputed_subset %>%
mutate(HUQ041 = recode(HUQ041 ,
"1"= "Clinic or health center",
"2"= "Doctor's office or HMO",
"3"= "Hospital emergency room",
"4"= "Hospital outpatient department",
"5"= "Some other place",
"6"= "Doesn't go to one place most often",
"77"= "Refused",
"99"= "Unknown"))
ques_data_imputed_subset <- ques_data_imputed_subset %>%
mutate(HUQ051 = recode(HUQ051 ,
"0"= "None",
"1"= "1",
"2"= "2 to 3",
"3"= "4 to 5",
"4"= "6 to 7",
"5"= "8 to 9",
"6"= "10 to 12",
"7"= "13 to 15",
"8"= "16 or more",
"77"= "Refused",
"99"= "Unknown"))
ques_data_imputed_subset <- ques_data_imputed_subset %>%
mutate(IND235 = recode(IND235 ,
"1"= "$0 - $399",
"2"= "$400 - $799",
"3"= "$800 - $1249",
"4"= "$1250 - $1649",
"5"= "$1650 - $2099",
"6"= "$2100 - $2899",
"7"= "$2900 - $3749",
"8"= "$3750 - $4599",
"9"= "$4600 - $5399",
"10"= "$5400 - $6249",
"11"= "$6250 - $8399",
"12"= "$8400 and over",
"77"= "Refused",
"99"= "Unknown"))
ques_data_imputed_subset <- ques_data_imputed_subset %>%
mutate(OHQ030 = recode(OHQ030 ,
"1"= "6 months or less",
"2"= "More than 6 months, but not more than 1 year ago",
"3"= "More than 1 year, but not more than 2 years ago",
"4"= "More than 2 years, but not more than 3 years ago",
"5"= "More than 3 years, but not more than 5 years ago",
"6"= "More than 5 years ago",
"7"= "Never have been",
"77"= "Refused",
"99"= "Unknown" ))
ques_data_imputed_subset <- ques_data_imputed_subset %>%
mutate(PAQ710 = recode(PAQ710 ,
"0"= "Less than 1 hour",
"1"= "1 hour",
"2"= "2 hours",
"3"= "3 hours",
"4"= "4 hours",
"5"= "5 hours or more",
"8"= "{You don't/SP does not} watch TV or videos",
"77"= "Refused",
"99"= "Unknown" ))
ques_data_imputed_subset <- ques_data_imputed_subset %>%
mutate(PAQ715 = recode(PAQ715 ,
"0"= "Less than 1 hour",
"1"= "1 hour",
"2"= "2 hours",
"3"= "3 hours",
"4"= "4 hours",
"5"= "5 hours or more",
"8"= "{you do not/SP does not} use a computer outside of school",
"77"= "Refused",
"99"= "Unknown" ))
ques_data_imputed_subset <- ques_data_imputed_subset %>%
mutate(SMD460 = recode(SMD460 ,
"0"= "No one in houseold is a smoker",
"1"= "1 household member is a smoker",
"2"= "2 household members are smokers",
"3"= "3 or more household members are smokers",
"777"= "Refused 5 10058 End of Section",
"999"= "Unknown"))
ques_data_imputed_subset <- ques_data_imputed_subset %>%
mutate(HOD050 = recode(HOD050 ,
"1"= "1",
"2"= "2",
"3"= "3",
"4"= "4",
"5"= "5",
"6"= "6",
"7"= "7",
"8"= "8",
"9"= "9",
"10"= "10",
"11"= "11",
"12"= "12",
"13"= "13 or more",
"777"= "Refused",
"999"= "Unknown" ))
s
ques_subset_labelled <- ques_data_imputed_subset
colnames(ques_subset_labelled) <- with(Dictionary,
Dictionary$Variable.Description[match(colnames(ques_data_imputed_subset),
Dictionary$Variable.Name,
nomatch = Dictionary$Variable.Name
)])
write.csv(ques_subset_labelled,file = "Data/Working/ques_subset_labelled.csv")
Perform visualization against the clean datasets and the union of the cleaned datasets
Visuals against the cleaned dataset
Visuals against the cleaned dataset
Visuals against the cleaned dataset
Visuals against the cleaned dataset
Visuals against the cleaned dataset
Visuals against the cleaned dataset
demo_subset_8 = read.csv("Data/Working/demo_subset_8_imputed.csv", header = TRUE, na.strings = c("NA","","#NA"))[-1]
target_disease_dataset = read.csv("Data/Working/target_disease_dataset.csv", header = TRUE, na.strings = c("NA","","#NA"))[-1]
demographic_imputed = read.csv("Data/Clean_Imputes/demographic_imputed.csv", header = TRUE, na.strings = c("NA","","#NA"))
library(devtools)
#install_github("vqv/ggbiplot")
library(ggbiplot)
demo_subset_8.pca <- prcomp(demo_subset_8[,c(2:9)], center = TRUE,scale = TRUE)
summary(demo_subset_8.pca)
str(demo_subset_8.pca)
#ggbiplot(demo_subset_8.pca)
screeplot(demo_subset_8.pca, type = "l", npcs = 8, main = "Screeplot of the 8 PCs")
abline(h = 1, col="red", lty=5)
legend("topright", legend=c("Eigenvalue = 1"),
col=c("red"), lty=5, cex=0.6)
cumpro <- cumsum(demo_subset_8.pca$sdev^2 / sum(demo_subset_8.pca$sdev^2))
plot(cumpro[0:8], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(v = 4, col="blue", lty=5)
abline(h = 0.5934, col="blue", lty=5)
legend("topleft", legend=c("Cut-off @ PC4"),
From the above graphs, we notice is that the first 4 components has an Eigenvalue >1 and explains almost 60% of variance! We can not effectively reduce dimensionality from 8 to 4 becuase we will lose about 40% of variance!
library("factoextra")
fviz_pca_ind(demo_subset_8.pca, geom.ind = "point", pointshape = 21,
pointsize = 2,
fill.ind = target_disease_dataset$HAS_DIABETES,
col.ind = "black",
palette = "jco",
addEllipses = TRUE,
label = "var",
col.var = "black",
repel = TRUE,
legend.title = "HAS_DIABETES") +
ggtitle("2D PCA-plot from 8 feature dataset") +
theme(plot.title = element_text(hjust = 0.5))
fviz_pca_ind(demo_subset_8.pca, geom.ind = "point", pointshape = 21,
pointsize = 2,
fill.ind = target_disease_dataset$HAS_HYPERTENSION,
col.ind = "black",
palette = "jco",
addEllipses = TRUE,
label = "var",
col.var = "black",
repel = TRUE,
legend.title = "HAS_HYPERTENSION") +
ggtitle("2D PCA-plot from 8 feature dataset") +
theme(plot.title = element_text(hjust = 0.5))
fviz_pca_ind(demo_subset_8.pca, geom.ind = "point", pointshape = 21,
pointsize = 2,
fill.ind = target_disease_dataset$HAS_CANCER,
col.ind = "black",
palette = "jco",
addEllipses = TRUE,
label = "var",
col.var = "black",
repel = TRUE,
legend.title = "HAS_CANCER") +
ggtitle("2D PCA-plot from 8 feature dataset") +
theme(plot.title = element_text(hjust = 0.5))
With just use the first two components, no diseases present separation between sick and healthy people . This clearly indicate that the we can not do classification base only on the demographics data.
#Elbow plot method.
library(purrr)
set.seed(226)
# function to calculate total intra-cluster sum of square
demo8_iss <- function(k) {
kmeans(demo_subset_8[,2:9],k,iter.max=100,nstart=100,algorithm="Lloyd" )$tot.withinss
}
k.values <- 1:10
demo8_iss_values <- map_dbl(k.values, demo8_iss)
plot(k.values, demo8_iss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total intra-clusters sum of squares")
From the above graph, we conclude that 6 is the appropriate number of clusters since it seems to be appearing at the bend in the elbow plot.
Now, let us take k = 6 as our optimal cluster
demo8_k6<-kmeans(demo_subset_8[,2:9],6,iter.max=100,nstart=50,algorithm="Lloyd")
demo8_k6
# Visualizing the Clustering Results using the First Two Principle Components
pcclust=prcomp(demo_subset_8[,2:9],scale=TRUE) #principal component analysis
summary(pcclust)
pcclust$rotation[,1:2]
set.seed(100)
ggplot(demo_subset_8, aes(x =Gender, y = Age)) +
geom_point(stat = "identity", aes(color = as.factor(demo8_k6$cluster))) +
scale_color_discrete(name=" ",
breaks=c("1", "2", "3", "4", "5","6"),
labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
ggtitle("Demographics Data ", subtitle = "Using K-means Clustering")
From the above visualization, we observe that in the clusters distribution both Male and female have almost the same range of age
In this section, we will combine all the datasets and perform clustering algorithms.
Find associations with diseases and diet/demographics data.
Associating mining if often used with market basket analysis. However, for healthcare dataset used NHANES, we will explore the associations between the data and attempt to provide value to addressing marketing business problems for the pharmedical company in adversiting their drugs and attracting individuals to clinical trails.
Our first task is to prepare the data for associating mining algorithms.
# Take interesting attributes from "data_selected" data
# "data_selected" is a subset of all the dataset combined.
association_dataset <- data_selected
association_target_dataset <- target_disease_dataset[ -c(2,3,4)]
# Merge our association data with the target dataset
# Target dataset contains all diseases (diabetes, cancer, hypertension)
association_dataset <- merge(association_dataset, association_target_dataset,by="ID")
Since the associations rules will reference the values of the attributes. If a value says “Yes”, it might be ambigious what this means. However, if the value was, “US Citizen”, then the meaning would be precise. Below are a couple of examples where, we have re-coded the values for attributes as shown below:
association_dataset <- association_dataset %>%
mutate(Milk_30 = recode(Milk_30 ,
"Never" = "Does not drink milk",
"Often-once a day or more?" = "Drinks milk multiple times a day",
"Rarely-less than once a week " = "Drinks milk once a week",
"Refused" = "Might be a milk drinker",
"Sometimes-once a week or more, but less than once a day" = "Drinks milk multiple times a week",
"Varied" = "Might be a milk drinker"
) )
association_dataset <- association_dataset %>%
mutate(Food_assistance = recode(Food_assistance ,
"No" = "Has not requested emergency food assistance",
"Refused" = "Unknown if emergency food assistance was requested",
"Yes" = "Has requested emergency food assistance"
) )
association_dataset <- association_dataset %>%
mutate(Insurance_current = recode(Insurance_current ,
"No" = "No health insurance coverage",
"Refused" = "Health insurance coverage unknown",
"Unknown" = "Health insurance coverage unknown",
"Yes" = "Has health insurance coverage"
) )
association_dataset <- association_dataset %>%
mutate(HAS_CANCER = recode(HAS_CANCER ,
"YES" = "HAS CANCER",
"NO" = "NO CANCER"
) )
The above recoding was performed for 18 attributes. Within the association dataset, we selected 18 attributes. We focused on attributes that were categorial values. For the purpose of association mining, numerical values may not add value unless they are binned into categories. For now, we have focused on 18 attributes that were available in the cleaned dataset. Sincer, the dataset is rich with many attributes. In the future, more attributes could be added into association mining algorithms if the business finds value in the suggestions of this type of analysis.
In order to apply association algorithms, the dataset has to transformed into a tranactional dataset. First, we need to merge all categorical values requiring for mining into a single description attriubte:
# Select columns required for mining analysis
association_test_columns <- c("ID", "Gender", "Race", "Country_of_birth", "Citizenship_status",
'Marital_status', "Family_income", "Dominant_hand_exam", "Milk_30", "Food_assistance",
"Insurance_current", "Health_institution", "Gaming_hours", "Smoking_relatives", "Ride_motor_vehicle",
'HAS_DIABETES', 'HAS_CANCER', 'HAS_HYPERTENSION')
subset_association = subset(association_dataset, select=association_test_columns)
# Add a description attribute with all the attributes requried for unsupervised association mining analysis
subset_association <- subset_association %>%
mutate(description= paste(subset_association$Race, ",",
subset_association$Gender, ",",
subset_association$Country_of_birth, ",",
subset_association$Citizenship_status, ",",
subset_association$Marital_status, ",",
subset_association$Family_income, ",",
subset_association$Dominant_hand_exam, ",",
subset_association$Milk_30, ",",
subset_association$Food_assistance, ",",
subset_association$Insurance_current, ",",
subset_association$Health_institution, ",",
subset_association$Gaming_hours, ",",
subset_association$Smoking_relatives, ",",
subset_association$Ride_motor_vehicle, "," ,
subset_association$HAS_DIABETES, ",",
subset_association$HAS_CANCER, ",",
subset_association$HAS_HYPERTENSION
))
#Group descriptions by individual IDs and place into transactionData dataset.
transactionData <- ddply(subset_association, c("ID"),
function(subset_association)paste(subset_association$description))
# Remove the ID from the transaction dataset as it is not used.
transactionData$ID <- NULL
# Write the transaction dataset
write.csv(transactionData, "Data/Working/transactiondata.csv", quote=FALSE , row.names = FALSE)
# Read in the transaction dataset for use with the unsupervised algorithms
individuals_transaction_class <- read.transactions('Data/Working/transactiondata.csv', format = 'basket',sep=',')
Now data is prepared, we can apply the association algorithms.
First, we create association rules against the dataset.
# Create mining rules for all values. This will tell us which values are likely to be found together
rules_for_individuals <- apriori(individuals_transaction_class, parameter = list(supp=0.001, conf=0.85, maxlen=5))
We plot the 20 most frequent values found within the data.
Per the above, as expected, US citizen, right-handed, born in US are some of the most frequent values. Also, it is also that the values for not having diseases is also at the top of the list.
OVer 400,000 rules are produced for entire data, let’s take a glance at 5 of them below.
inspect(rules_for_individuals[1:5])
lhs rhs support confidence lift count
[1] {} => {Right-handed} 0.874363155 0.8743632 1.000000 8581
[2] {} => {Has not requested emergency food assistance} 0.889341757 0.8893418 1.000000 8728
[3] {} => {US citizen} 0.905339311 0.9053393 1.000000 8885
[4] {Mostly visits doctors office for healthcare , Does not play video games , No smokers present in house , Has not rode in a vehicle within the past 7 days , HAS DIABETES , NO CANCER , NO HYPERTENSION} => {Has not requested emergency food assistance} 0.001018953 1.0000000 1.124427 10
[5] {Mostly visits doctors office for healthcare , Plays less than an hours of video games over the past 30 days , Smokers present in house , Has not rode in a vehicle within the past 7 days , NO DIABETES , NO CANCER , HAS HYPERTENSION} => {US citizen} 0.001018953 1.0000000 1.104558 10
In the above output, we can see different association mining rules for the entire dataset. The rules have LHS and RHS which demonstrate the relation between itemsets(collections of values). The items on LHS are associated and occur with the single item on the RHS. Now we will proceed to create association rules for having and not having the particular diseases (cancer, diabetes, hypertension). The RHS will be set to the particular health conditions/disases. And we will observe what typse of associations are discovered on the LHS.
In order to produce a list of association rules, we had to experiement with “conf”(confidence) parameter. For example, with positive cancer rules, we had to lower the confidence to 0.4 to produce mining rules. For each health condition(disaease),we have created 2 sets of rules. The first set of rules allow larger number of items to be produced on the LHS (maxlen=15); whereas, the second set of rules forces the rules to have a small amount of rules (maxlen=3).
# Association for having cancer (large itemset allowed on LHS, maxlen=15)
has_cancer.association.rules <- apriori(individuals_transaction_class, parameter = list(supp=0.001, conf=0.4, maxlen=15), appearance=list(default="lhs", rhs="HAS CANCER"))
# Association for having cancer (small itemset allowed on LHS, maxlen=3)
has_cancer.association.rules_smallitemset <- apriori(individuals_transaction_class, parameter = list(supp=0.001, conf=0.1, maxlen=3), appearance=list(default="lhs", rhs="HAS CANCER"))
# Association for having diabetes (large itemset allowed on LHS, maxlen=15)
has_diabetes.association.rules <- apriori(individuals_transaction_class, parameter = list(supp=0.001, conf=0.7, maxlen=15), appearance=list(default="lhs", rhs="HAS DIABETES"))
# Association for having diabetes (small itemset allowed on LHS, maxlen=3)
has_diabetes.association.rules_smallitemset <- apriori(individuals_transaction_class, parameter = list(supp=0.001, conf=0.4, maxlen=3), appearance=list(default="lhs", rhs="HAS DIABETES"))
# Association for having hypertension (large itemset allowed on LHS, maxlen=15)
has_hypertension.association.rules_smallitemset <- apriori(individuals_transaction_class, parameter = list(supp=0.001, conf=0.4, maxlen=3), appearance=list(default="lhs", rhs="HAS HYPERTENSION"))
# Association for having hypertension (small itemset allowed on LHS, maxlen=3)
has_hypertension.association.rules <- apriori(individuals_transaction_class, parameter = list(supp=0.001, conf=0.8, maxlen=15), appearance=list(default="lhs", rhs="HAS HYPERTENSION"))
Now with association rules, we can examine them and observe the LHS items that are found in conjunction with various diseases:
First, we look at large item sets.
> inspect(has_cancer.association.rules)
lhs rhs support confidence lift count
[1] {Has health insurance coverage,
HAS HYPERTENSION,
Male,
Mostly visits outpatient departments for healthcare} => {HAS CANCER} 0.001018953 0.4347826 47.94333 10
[2] {Has health insurance coverage,
HAS HYPERTENSION,
Male,
Mostly visits outpatient departments for healthcare,
US citizen} => {HAS CANCER} 0.001018953 0.4347826 47.94333 10
[3] {{you do not/SP does not} use a computer outside of school,
Drinks milk multiple times a week,
HAS DIABETES,
Has not requested emergency food assistance,
Mostly visits a clinic or health center for healthcare} => {HAS CANCER} 0.001018953 0.4761905 52.50936 10
[4] {Born in US,
Drinks milk multiple times a day,
HAS DIABETES,
Has health insurance coverage,
No smokers present in house} => {HAS CANCER} 0.001018953 0.4000000 44.10787 10
[5] {Born in US,
HAS DIABETES,
Has health insurance coverage,
No smokers present in house,
Right-handed,
White} => {HAS CANCER} 0.001018953 0.4000000 44.10787 10
[6] {Born in US,
Drinks milk multiple times a day,
HAS DIABETES,
Has health insurance coverage,
No smokers present in house,
Right-handed} => {HAS CANCER} 0.001018953 0.4000000 44.10787 10
[7] {Born in US,
Drinks milk multiple times a day,
HAS DIABETES,
Has health insurance coverage,
No smokers present in house,
US citizen} => {HAS CANCER} 0.001018953 0.4000000 44.10787 10
[8] {Born in US,
Drinks milk multiple times a day,
Has health insurance coverage,
HAS HYPERTENSION,
Male,
No smokers present in house} => {HAS CANCER} 0.001120848 0.4583333 50.54026 11
[9] {Drinks milk multiple times a day,
Has health insurance coverage,
HAS HYPERTENSION,
Male,
No smokers present in house,
Right-handed} => {HAS CANCER} 0.001120848 0.4074074 44.92468 11
[10] {Drinks milk multiple times a day,
Has health insurance coverage,
HAS HYPERTENSION,
Male,
No smokers present in house,
US citizen} => {HAS CANCER} 0.001222743 0.4137931 45.62883 12
Next we look at small itemsets:
> inspect(has_cancer.association.rules_smallitemset)
lhs rhs support confidence lift count
[1] {Mostly visits outpatient departments for healthcare} => {HAS CANCER} 0.001018953 0.1010101 11.13835 10
[2] {HAS DIABETES} => {HAS CANCER} 0.003362543 0.1658291 18.28592 33
[3] {HAS HYPERTENSION} => {HAS CANCER} 0.005400448 0.1051587 11.59582 53
[4] {HAS HYPERTENSION,Mostly visits outpatient departments for healthcare} => {HAS CANCER} 0.001018953 0.2500000 27.56742 10
[5] {Male,Mostly visits outpatient departments for healthcare} => {HAS CANCER} 0.001018953 0.1666667 18.37828 10
[6] {Has health insurance coverage,Mostly visits outpatient departments for healthcare} => {HAS CANCER} 0.001018953 0.1234568 13.61354 10
[7] {Mostly visits outpatient departments for healthcare,US citizen} => {HAS CANCER} 0.001018953 0.1111111 12.25218 10
[8] {HAS DIABETES,HAS HYPERTENSION} => {HAS CANCER} 0.002343591 0.1782946 19.66048 23
[9] {Does not play video games,HAS DIABETES} => {HAS CANCER} 0.002139800 0.1794872 19.79199 21
[10] {Born outside of US,HAS DIABETES} => {HAS CANCER} 0.001018953 0.1219512 13.44752 10
[11] {Drinks milk multiple times a week,HAS DIABETES} => {HAS CANCER} 0.001324638 0.2166667 23.89176 13
[12] {HAS DIABETES,No smokers present in house} => {HAS CANCER} 0.003056858 0.2000000 22.05393 30
[13] {HAS DIABETES,Mostly visits a clinic or health center for healthcare} => {HAS CANCER} 0.002649277 0.1733333 19.11341 26
[14] {HAS DIABETES,NO HYPERTENSION} => {HAS CANCER} 0.001018953 0.1428571 15.75281 10
[15] {HAS DIABETES,Has rode in a vehicle within the past 7 days} => {HAS CANCER} 0.002649277 0.1656051 18.26122 26
[16] {HAS DIABETES,White} => {HAS CANCER} 0.001222743 0.2926829 32.27405 12
[17] {HAS DIABETES,Male} => {HAS CANCER} 0.001732219 0.1666667 18.37828 17
[18] {Female,HAS DIABETES} => {HAS CANCER} 0.001630324 0.1649485 18.18881 16
[19] {Drinks milk multiple times a day,HAS DIABETES} => {HAS CANCER} 0.001426534 0.2121212 23.39053 14
[20] {HAS DIABETES,Married} => {HAS CANCER} 0.001426534 0.1473684 16.25027 14
[21] {Born in US,HAS DIABETES} => {HAS CANCER} 0.002343591 0.1965812 21.67694 23
[22] {HAS DIABETES,Has health insurance coverage} => {HAS CANCER} 0.002751172 0.1849315 20.39233 27
[23] {HAS DIABETES,Right-handed} => {HAS CANCER} 0.003056858 0.1685393 18.58477 30
[24] {HAS DIABETES,Has not requested emergency food assistance} => {HAS CANCER} 0.002751172 0.1730769 19.08513 27
[25] {HAS DIABETES,US citizen} => {HAS CANCER} 0.002853067 0.1728395 19.05895 28
[26] {HAS HYPERTENSION,Widowed} => {HAS CANCER} 0.001120848 0.2037037 22.46234 11
[27] {No smokers present in house,Widowed} => {HAS CANCER} 0.001528429 0.1648352 18.17632 15
[28] {Has rode in a vehicle within the past 7 days,Widowed} => {HAS CANCER} 0.001018953 0.1052632 11.60733 10
[29] {HAS HYPERTENSION,Has not rode in a vehicle within the past 7 days} => {HAS CANCER} 0.001120848 0.1235955 13.62883 11
[30] {HAS HYPERTENSION,Plays less than an hours of video games over the past 30 days} => {HAS CANCER} 0.001018953 0.1020408 11.25201 10
[31] {Divorced,HAS HYPERTENSION} => {HAS CANCER} 0.001120848 0.1358025 14.97489 11
[32] {HAS HYPERTENSION,Has requested emergency food assistance} => {HAS CANCER} 0.001426534 0.1359223 14.98811 14
[33] {Does not play video games,HAS HYPERTENSION} => {HAS CANCER} 0.002445486 0.1159420 12.78489 24
[34] {Black,HAS HYPERTENSION} => {HAS CANCER} 0.001528429 0.1027397 11.32907 15
[35] {Drinks milk multiple times a week,HAS HYPERTENSION} => {HAS CANCER} 0.002037905 0.1226994 13.53002 20
[36] {HAS HYPERTENSION,No smokers present in house} => {HAS CANCER} 0.004177705 0.1198830 13.21946 41
[37] {HAS HYPERTENSION,Has rode in a vehicle within the past 7 days} => {HAS CANCER} 0.004279601 0.1012048 11.15982 42
[38] {HAS HYPERTENSION,White} => {HAS CANCER} 0.002343591 0.1586207 17.49105 23
[39] {HAS HYPERTENSION,Male} => {HAS CANCER} 0.002751172 0.1097561 12.10277 27
[40] {Female,HAS HYPERTENSION} => {HAS CANCER} 0.002649277 0.1007752 11.11245 26
[41] {Drinks milk multiple times a day,HAS HYPERTENSION} => {HAS CANCER} 0.002343591 0.1428571 15.75281 23
[42] {Born in US,HAS HYPERTENSION} => {HAS CANCER} 0.004483391 0.1282799 14.14538 44
[43] {Has health insurance coverage,HAS HYPERTENSION} => {HAS CANCER} 0.004890972 0.1333333 14.70262 48
[44] {HAS HYPERTENSION,Right-handed} => {HAS CANCER} 0.004890972 0.1083521 11.94795 48
[45] {HAS HYPERTENSION,US citizen} => {HAS CANCER} 0.005094763 0.1187648 13.09616 50
>
In order to build the association mining lists, we had to reduce confidence levels to under 0.5.
> inspect(has_diabetes.association.rules)
lhs rhs support confidence lift count
[1] {{you do not/SP does not} use a computer outside of school,
Drinks milk multiple times a week,
HAS CANCER,
Mostly visits a clinic or health center for healthcare} => {HAS DIABETES} 0.001120848 0.7857143 38.74874 11
[2] {{you do not/SP does not} use a computer outside of school,
Drinks milk multiple times a week,
HAS CANCER,
Has not requested emergency food assistance} => {HAS DIABETES} 0.001018953 0.9090909 44.83326 10
[3] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Has rode in a vehicle within the past 7 days,
Household income between $20000 - $24999} => {HAS DIABETES} 0.001324638 0.7222222 35.61753 13
[4] {{you do not/SP does not} use a computer outside of school,
HAS CANCER,
HAS HYPERTENSION,
Mostly visits a clinic or health center for healthcare,
US citizen} => {HAS DIABETES} 0.001018953 0.7142857 35.22613 10
[5] {{you do not/SP does not} use a computer outside of school,
HAS CANCER,
HAS HYPERTENSION,
Has not requested emergency food assistance,
US citizen} => {HAS DIABETES} 0.001018953 0.7142857 35.22613 10
[6] {{you do not/SP does not} use a computer outside of school,
Drinks milk multiple times a week,
HAS CANCER,
Has not requested emergency food assistance,
Mostly visits a clinic or health center for healthcare} => {HAS DIABETES} 0.001018953 0.9090909 44.83326 10
[7] {{you do not/SP does not} use a computer outside of school,
HAS CANCER,
Has not requested emergency food assistance,
Mostly visits a clinic or health center for healthcare,
US citizen} => {HAS DIABETES} 0.001018953 0.7692308 37.93583 10
[8] {{you do not/SP does not} use a computer outside of school,
Drinks milk multiple times a day,
HAS HYPERTENSION,
Has not rode in a vehicle within the past 7 days,
US citizen} => {HAS DIABETES} 0.001018953 0.7142857 35.22613 10
[9] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Has rode in a vehicle within the past 7 days,
Household income between $20000 - $24999,
Mostly visits a clinic or health center for healthcare} => {HAS DIABETES} 0.001120848 0.7333333 36.16549 11
[10] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Household income between $20000 - $24999,
Mostly visits a clinic or health center for healthcare,
Right-handed} => {HAS DIABETES} 0.001222743 0.7058824 34.81171 12
[11] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Has rode in a vehicle within the past 7 days,
Household income between $20000 - $24999,
Right-handed} => {HAS DIABETES} 0.001324638 0.7647059 37.71268 13
[12] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Has not requested emergency food assistance,
Has rode in a vehicle within the past 7 days,
Household income between $20000 - $24999} => {HAS DIABETES} 0.001018953 0.7142857 35.22613 10
[13] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Has rode in a vehicle within the past 7 days,
Household income between $20000 - $24999,
US citizen} => {HAS DIABETES} 0.001120848 0.7857143 38.74874 11
[14] {Drinks milk multiple times a week,
Has health insurance coverage,
HAS HYPERTENSION,
Has rode in a vehicle within the past 7 days,
Mexican_American} => {HAS DIABETES} 0.001222743 0.7058824 34.81171 12
[15] {Drinks milk multiple times a week,
Has health insurance coverage,
HAS HYPERTENSION,
Mexican_American,
US citizen} => {HAS DIABETES} 0.001018953 0.7142857 35.22613 10
[16] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Has not requested emergency food assistance,
Mexican_American,
US citizen} => {HAS DIABETES} 0.001018953 0.7142857 35.22613 10
[17] {Born in US,
Has health insurance coverage,
HAS HYPERTENSION,
Mexican_American,
No smokers present in house} => {HAS DIABETES} 0.001222743 0.7058824 34.81171 12
[18] {Born in US,
Drinks milk multiple times a day,
Has health insurance coverage,
HAS HYPERTENSION,
Right-handed,
Widowed} => {HAS DIABETES} 0.001018953 0.7142857 35.22613 10
[19] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Has rode in a vehicle within the past 7 days,
Household income between $20000 - $24999,
Mostly visits a clinic or health center for healthcare,
Right-handed} => {HAS DIABETES} 0.001120848 0.7857143 38.74874 11
[20] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Has not requested emergency food assistance,
Household income between $20000 - $24999,
Mostly visits a clinic or health center for healthcare,
Right-handed} => {HAS DIABETES} 0.001018953 0.7142857 35.22613 10
[21] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Household income between $20000 - $24999,
Mostly visits a clinic or health center for healthcare,
Right-handed,
US citizen} => {HAS DIABETES} 0.001018953 0.7142857 35.22613 10
[22] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Has rode in a vehicle within the past 7 days,
Household income between $20000 - $24999,
NO CANCER,
Right-handed} => {HAS DIABETES} 0.001120848 0.7333333 36.16549 11
[23] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Has not requested emergency food assistance,
Has rode in a vehicle within the past 7 days,
Household income between $20000 - $24999,
Right-handed} => {HAS DIABETES} 0.001018953 0.7692308 37.93583 10
[24] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Has rode in a vehicle within the past 7 days,
Household income between $20000 - $24999,
Right-handed,
US citizen} => {HAS DIABETES} 0.001120848 0.7857143 38.74874 11
[25] {Has health insurance coverage,
HAS HYPERTENSION,
Household income between $20000 - $24999,
Male,
No smokers present in house,
Right-handed} => {HAS DIABETES} 0.001018953 0.7142857 35.22613 10
[26] {Drinks milk multiple times a week,
Has health insurance coverage,
HAS HYPERTENSION,
Has not requested emergency food assistance,
Has rode in a vehicle within the past 7 days,
Mexican_American} => {HAS DIABETES} 0.001120848 0.7333333 36.16549 11
[27] {Drinks milk multiple times a week,
Has health insurance coverage,
HAS HYPERTENSION,
Has rode in a vehicle within the past 7 days,
Mexican_American,
US citizen} => {HAS DIABETES} 0.001018953 0.7692308 37.93583 10
[28] {Drinks milk multiple times a week,
HAS HYPERTENSION,
Has not requested emergency food assistance,
Has rode in a vehicle within the past 7 days,
Mexican_American,
US citizen} => {HAS DIABETES} 0.001018953 0.7142857 35.22613 10
[29] {Born in US,
Has health insurance coverage,
HAS HYPERTENSION,
Has rode in a vehicle within the past 7 days,
Mexican_American,
No smokers present in house} => {HAS DIABETES} 0.001222743 0.7500000 36.98744 12
[30] {Born in US,
Has health insurance coverage,
HAS HYPERTENSION,
Mexican_American,
No smokers present in house,
Right-handed} => {HAS DIABETES} 0.001222743 0.7058824 34.81171 12
[31] {Born in US,
Has health insurance coverage,
HAS HYPERTENSION,
Has not requested emergency food assistance,
Mexican_American,
No smokers present in house} => {HAS DIABETES} 0.001018953 0.7142857 35.22613 10
[32] {Born in US,
Has health insurance coverage,
HAS HYPERTENSION,
Mexican_American,
No smokers present in house,
US citizen} => {HAS DIABETES} 0.001222743 0.7058824 34.81171 12
>
Next, we look at small itemsets
> inspect(has_diabetes.association.rules_smallitemset)
lhs rhs support confidence lift count
[1] {HAS CANCER,HAS HYPERTENSION} => {HAS DIABETES} 0.002343591 0.4339623 21.40154 23
[2] {Does not play video games,HAS CANCER} => {HAS DIABETES} 0.002139800 0.5384615 26.55508 21
[3] {Born outside of US,HAS CANCER} => {HAS DIABETES} 0.001018953 0.5882353 29.00975 10
[4] {Drinks milk multiple times a week,HAS CANCER} => {HAS DIABETES} 0.001324638 0.4062500 20.03486 13
[5] {HAS CANCER,No smokers present in house} => {HAS DIABETES} 0.003056858 0.4225352 20.83799 30
[6] {HAS CANCER,Mostly visits a clinic or health center for healthcare} => {HAS DIABETES} 0.002649277 0.4000000 19.72663 26
[7] {Drinks milk multiple times a day,HAS CANCER} => {HAS DIABETES} 0.001426534 0.4000000 19.72663 14
[8] {HAS CANCER,Has not requested emergency food assistance} => {HAS DIABETES} 0.002751172 0.4218750 20.80543 27
>
inspect(has_hypertension.association.rules[1:100])
lhs rhs support confidence lift count
[1] {HAS CANCER,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.001018953 1.0000000 19.47222 10
[2] {Black,
HAS CANCER} => {HAS HYPERTENSION} 0.001528429 0.8333333 16.22685 15
[3] {HAS DIABETES,
Household income between $20000 - $24999} => {HAS HYPERTENSION} 0.002649277 0.8387097 16.33154 26
[4] {Black,
HAS DIABETES} => {HAS HYPERTENSION} 0.004177705 0.8200000 15.96722 41
[5] {HAS CANCER,
Male,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.001018953 1.0000000 19.47222 10
[6] {HAS CANCER,
Has health insurance coverage,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.001018953 1.0000000 19.47222 10
[7] {HAS CANCER,
Mostly visits outpatient departments for healthcare,
US citizen} => {HAS HYPERTENSION} 0.001018953 1.0000000 19.47222 10
[8] {HAS CANCER,
Has rode in a vehicle within the past 7 days,
Plays less than an hours of video games over the past 30 days} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[9] {{you do not/SP does not} use a computer outside of school,
Black,
HAS CANCER} => {HAS HYPERTENSION} 0.001120848 0.9166667 17.84954 11
[10] {Black,
Drinks milk multiple times a week,
HAS CANCER} => {HAS HYPERTENSION} 0.001018953 0.9090909 17.70202 10
[11] {Black,
HAS CANCER,
No smokers present in house} => {HAS HYPERTENSION} 0.001120848 0.8461538 16.47650 11
[12] {Black,
HAS CANCER,
Mostly visits a clinic or health center for healthcare} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[13] {Black,
HAS CANCER,
Male} => {HAS HYPERTENSION} 0.001120848 0.8461538 16.47650 11
[14] {Black,
Born in US,
HAS CANCER} => {HAS HYPERTENSION} 0.001528429 0.8333333 16.22685 15
[15] {Black,
HAS CANCER,
Has health insurance coverage} => {HAS HYPERTENSION} 0.001426534 0.8235294 16.03595 14
[16] {Black,
HAS CANCER,
Right-handed} => {HAS HYPERTENSION} 0.001324638 0.8666667 16.87593 13
[17] {Black,
HAS CANCER,
Has not requested emergency food assistance} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[18] {Black,
HAS CANCER,
US citizen} => {HAS HYPERTENSION} 0.001528429 0.8333333 16.22685 15
[19] {Drinks milk multiple times a day,
HAS CANCER,
Male} => {HAS HYPERTENSION} 0.001426534 0.8235294 16.03595 14
[20] {$5000 - $9999,
HAS DIABETES,
No smokers present in house} => {HAS HYPERTENSION} 0.001120848 0.8461538 16.47650 11
[21] {$5000 - $9999,
HAS DIABETES,
Has health insurance coverage} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[22] {$5000 - $9999,
HAS DIABETES,
Right-handed} => {HAS HYPERTENSION} 0.001222743 0.8000000 15.57778 12
[23] {$5000 - $9999,
HAS DIABETES,
US citizen} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[24] {HAS DIABETES,
White,
Widowed} => {HAS HYPERTENSION} 0.001120848 0.8461538 16.47650 11
[25] {Female,
HAS DIABETES,
Widowed} => {HAS HYPERTENSION} 0.001426534 0.8750000 17.03819 14
[26] {Born in US,
HAS DIABETES,
Widowed} => {HAS HYPERTENSION} 0.001426534 0.8235294 16.03595 14
[27] {HAS DIABETES,
Has health insurance coverage,
Widowed} => {HAS HYPERTENSION} 0.002037905 0.8000000 15.57778 20
[28] {HAS DIABETES,
US citizen,
Widowed} => {HAS HYPERTENSION} 0.002037905 0.8000000 15.57778 20
[29] {Female,
HAS DIABETES,
Has not rode in a vehicle within the past 7 days} => {HAS HYPERTENSION} 0.002139800 0.8076923 15.72756 21
[30] {Born in US,
HAS DIABETES,
Has not rode in a vehicle within the past 7 days} => {HAS HYPERTENSION} 0.001324638 0.8125000 15.82118 13
[31] {$10000 - $14999,
{you do not/SP does not} use a computer outside of school,
HAS DIABETES} => {HAS HYPERTENSION} 0.001222743 0.8000000 15.57778 12
[32] {$10000 - $14999,
Born in US,
HAS DIABETES} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[33] {HAS DIABETES,
Has not requested emergency food assistance,
Plays less than an hours of video games over the past 30 days} => {HAS HYPERTENSION} 0.001630324 0.8000000 15.57778 16
[34] {Black,
HAS DIABETES,
Smokers present in house} => {HAS HYPERTENSION} 0.001426534 0.8235294 16.03595 14
[35] {Black,
HAS DIABETES,
Household income between $20000 - $24999} => {HAS HYPERTENSION} 0.001120848 1.0000000 19.47222 11
[36] {Drinks milk multiple times a week,
HAS DIABETES,
Household income between $20000 - $24999} => {HAS HYPERTENSION} 0.001426534 0.8750000 17.03819 14
[37] {HAS DIABETES,
Household income between $20000 - $24999,
Mostly visits a clinic or health center for healthcare} => {HAS HYPERTENSION} 0.002343591 0.8214286 15.99504 23
[38] {HAS DIABETES,
Has rode in a vehicle within the past 7 days,
Household income between $20000 - $24999} => {HAS HYPERTENSION} 0.002139800 0.8750000 17.03819 21
[39] {HAS DIABETES,
Household income between $20000 - $24999,
NO CANCER} => {HAS HYPERTENSION} 0.002139800 0.8750000 17.03819 21
[40] {Female,
HAS DIABETES,
Household income between $20000 - $24999} => {HAS HYPERTENSION} 0.001324638 0.9285714 18.08135 13
[41] {Born in US,
HAS DIABETES,
Household income between $20000 - $24999} => {HAS HYPERTENSION} 0.002037905 0.9090909 17.70202 20
[42] {HAS DIABETES,
Has health insurance coverage,
Household income between $20000 - $24999} => {HAS HYPERTENSION} 0.002139800 0.8400000 16.35667 21
[43] {HAS DIABETES,
Household income between $20000 - $24999,
Right-handed} => {HAS HYPERTENSION} 0.002343591 0.8214286 15.99504 23
[44] {HAS DIABETES,
Has not requested emergency food assistance,
Household income between $20000 - $24999} => {HAS HYPERTENSION} 0.001834115 0.8181818 15.93182 18
[45] {HAS DIABETES,
Household income between $20000 - $24999,
US citizen} => {HAS HYPERTENSION} 0.002445486 0.8888889 17.30864 24
[46] {Divorced,
Female,
HAS DIABETES} => {HAS HYPERTENSION} 0.001324638 0.8125000 15.82118 13
[47] {Divorced,
HAS DIABETES,
Has health insurance coverage} => {HAS HYPERTENSION} 0.001630324 0.8000000 15.57778 16
[48] {{you do not/SP does not} use a computer outside of school,
HAS DIABETES,
Has requested emergency food assistance} => {HAS HYPERTENSION} 0.002343591 0.8518519 16.58745 23
[49] {Black,
HAS DIABETES,
Has requested emergency food assistance} => {HAS HYPERTENSION} 0.001630324 0.8888889 17.30864 16
[50] {{you do not/SP does not} use a computer outside of school,
Black,
HAS DIABETES} => {HAS HYPERTENSION} 0.002751172 0.9642857 18.77679 27
[51] {Born in US,
HAS DIABETES,
Mexican_American} => {HAS HYPERTENSION} 0.001630324 0.8000000 15.57778 16
[52] {Black,
Drinks milk multiple times a week,
HAS DIABETES} => {HAS HYPERTENSION} 0.001732219 0.8947368 17.42251 17
[53] {Black,
HAS DIABETES,
No smokers present in house} => {HAS HYPERTENSION} 0.002649277 0.8125000 15.82118 26
[54] {Black,
HAS DIABETES,
Mostly visits a clinic or health center for healthcare} => {HAS HYPERTENSION} 0.003566334 0.8536585 16.62263 35
[55] {Black,
HAS DIABETES,
Has rode in a vehicle within the past 7 days} => {HAS HYPERTENSION} 0.003362543 0.8250000 16.06458 33
[56] {Black,
HAS DIABETES,
NO CANCER} => {HAS HYPERTENSION} 0.003464439 0.8095238 15.76323 34
[57] {Black,
Female,
HAS DIABETES} => {HAS HYPERTENSION} 0.001936010 0.8636364 16.81692 19
[58] {Black,
Drinks milk multiple times a day,
HAS DIABETES} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[59] {Black,
HAS DIABETES,
Married} => {HAS HYPERTENSION} 0.001834115 0.8181818 15.93182 18
[60] {Black,
Born in US,
HAS DIABETES} => {HAS HYPERTENSION} 0.004075810 0.8695652 16.93237 40
[61] {Black,
HAS DIABETES,
Has health insurance coverage} => {HAS HYPERTENSION} 0.003056858 0.8571429 16.69048 30
[62] {Black,
HAS DIABETES,
US citizen} => {HAS HYPERTENSION} 0.004177705 0.8200000 15.96722 41
[63] {Born in US,
Drinks milk multiple times a week,
HAS DIABETES} => {HAS HYPERTENSION} 0.003056858 0.8108108 15.78829 30
[64] {Drinks milk multiple times a week,
HAS DIABETES,
US citizen} => {HAS HYPERTENSION} 0.003770124 0.8043478 15.66244 37
[65] {HAS CANCER,
Has health insurance coverage,
Male,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.001018953 1.0000000 19.47222 10
[66] {HAS CANCER,
Male,
Mostly visits outpatient departments for healthcare,
US citizen} => {HAS HYPERTENSION} 0.001018953 1.0000000 19.47222 10
[67] {HAS CANCER,
Has health insurance coverage,
Mostly visits outpatient departments for healthcare,
US citizen} => {HAS HYPERTENSION} 0.001018953 1.0000000 19.47222 10
[68] {HAS CANCER,
Has health insurance coverage,
Has rode in a vehicle within the past 7 days,
Smokers present in house} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[69] {{you do not/SP does not} use a computer outside of school,
Black,
Born in US,
HAS CANCER} => {HAS HYPERTENSION} 0.001120848 0.9166667 17.84954 11
[70] {{you do not/SP does not} use a computer outside of school,
Black,
HAS CANCER,
Has health insurance coverage} => {HAS HYPERTENSION} 0.001018953 0.9090909 17.70202 10
[71] {{you do not/SP does not} use a computer outside of school,
Black,
HAS CANCER,
US citizen} => {HAS HYPERTENSION} 0.001120848 0.9166667 17.84954 11
[72] {Black,
Born in US,
Drinks milk multiple times a week,
HAS CANCER} => {HAS HYPERTENSION} 0.001018953 0.9090909 17.70202 10
[73] {Black,
Drinks milk multiple times a week,
HAS CANCER,
US citizen} => {HAS HYPERTENSION} 0.001018953 0.9090909 17.70202 10
[74] {Black,
Born in US,
HAS CANCER,
No smokers present in house} => {HAS HYPERTENSION} 0.001120848 0.8461538 16.47650 11
[75] {Black,
HAS CANCER,
Has health insurance coverage,
No smokers present in house} => {HAS HYPERTENSION} 0.001120848 0.8461538 16.47650 11
[76] {Black,
HAS CANCER,
No smokers present in house,
Right-handed} => {HAS HYPERTENSION} 0.001018953 0.9090909 17.70202 10
[77] {Black,
HAS CANCER,
No smokers present in house,
US citizen} => {HAS HYPERTENSION} 0.001120848 0.8461538 16.47650 11
[78] {Black,
Born in US,
HAS CANCER,
Mostly visits a clinic or health center for healthcare} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[79] {Black,
HAS CANCER,
Mostly visits a clinic or health center for healthcare,
US citizen} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[80] {Black,
Born in US,
HAS CANCER,
Male} => {HAS HYPERTENSION} 0.001120848 0.8461538 16.47650 11
[81] {Black,
HAS CANCER,
Has health insurance coverage,
Male} => {HAS HYPERTENSION} 0.001120848 0.8461538 16.47650 11
[82] {Black,
HAS CANCER,
Male,
US citizen} => {HAS HYPERTENSION} 0.001120848 0.8461538 16.47650 11
[83] {Black,
Born in US,
HAS CANCER,
Has health insurance coverage} => {HAS HYPERTENSION} 0.001426534 0.8235294 16.03595 14
[84] {Black,
Born in US,
HAS CANCER,
Right-handed} => {HAS HYPERTENSION} 0.001324638 0.8666667 16.87593 13
[85] {Black,
Born in US,
HAS CANCER,
Has not requested emergency food assistance} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[86] {Black,
Born in US,
HAS CANCER,
US citizen} => {HAS HYPERTENSION} 0.001528429 0.8333333 16.22685 15
[87] {Black,
HAS CANCER,
Has health insurance coverage,
Right-handed} => {HAS HYPERTENSION} 0.001222743 0.8571429 16.69048 12
[88] {Black,
HAS CANCER,
Has health insurance coverage,
US citizen} => {HAS HYPERTENSION} 0.001426534 0.8235294 16.03595 14
[89] {Black,
HAS CANCER,
Right-handed,
US citizen} => {HAS HYPERTENSION} 0.001324638 0.8666667 16.87593 13
[90] {Black,
HAS CANCER,
Has not requested emergency food assistance,
US citizen} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[91] {Drinks milk multiple times a day,
HAS CANCER,
Male,
No smokers present in house} => {HAS HYPERTENSION} 0.001222743 0.8000000 15.57778 12
[92] {Drinks milk multiple times a day,
HAS CANCER,
Has rode in a vehicle within the past 7 days,
Male} => {HAS HYPERTENSION} 0.001120848 0.8461538 16.47650 11
[93] {Born in US,
Drinks milk multiple times a day,
HAS CANCER,
Male} => {HAS HYPERTENSION} 0.001324638 0.8125000 15.82118 13
[94] {Drinks milk multiple times a day,
HAS CANCER,
Has health insurance coverage,
Male} => {HAS HYPERTENSION} 0.001426534 0.8235294 16.03595 14
[95] {Drinks milk multiple times a day,
HAS CANCER,
Male,
Right-handed} => {HAS HYPERTENSION} 0.001324638 0.8125000 15.82118 13
[96] {Drinks milk multiple times a day,
HAS CANCER,
Male,
US citizen} => {HAS HYPERTENSION} 0.001426534 0.8235294 16.03595 14
[97] {Born in US,
HAS DIABETES,
Male,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[98] {$5000 - $9999,
HAS DIABETES,
NO CANCER,
No smokers present in house} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
[99] {$5000 - $9999,
HAS DIABETES,
No smokers present in house,
Right-handed} => {HAS HYPERTENSION} 0.001120848 0.9166667 17.84954 11
[100] {$5000 - $9999,
HAS DIABETES,
Has health insurance coverage,
Right-handed} => {HAS HYPERTENSION} 0.001018953 0.8333333 16.22685 10
>
> inspect(has_hypertension.association.rules_smallitemset)
lhs rhs support confidence lift count
[1] {HAS CANCER} => {HAS HYPERTENSION} 0.005400448 0.5955056 11.595818 53
[2] {Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.004075810 0.4040404 7.867565 40
[3] {HAS DIABETES} => {HAS HYPERTENSION} 0.013144487 0.6482412 12.622697 129
[4] {HAS CANCER,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.001018953 1.0000000 19.472222 10
[5] {HAS CANCER,
HAS DIABETES} => {HAS HYPERTENSION} 0.002343591 0.6969697 13.571549 23
[6] {HAS CANCER,
Widowed} => {HAS HYPERTENSION} 0.001120848 0.6875000 13.387153 11
[7] {HAS CANCER,
Has not rode in a vehicle within the past 7 days} => {HAS HYPERTENSION} 0.001120848 0.5789474 11.273392 11
[8] {HAS CANCER,
Plays less than an hours of video games over the past 30 days} => {HAS HYPERTENSION} 0.001018953 0.7142857 13.908730 10
[9] {HAS CANCER,
Smokers present in house} => {HAS HYPERTENSION} 0.001222743 0.6666667 12.981481 12
[10] {Divorced,
HAS CANCER} => {HAS HYPERTENSION} 0.001120848 0.7857143 15.299603 11
[11] {HAS CANCER,
Has requested emergency food assistance} => {HAS HYPERTENSION} 0.001426534 0.5600000 10.904444 14
[12] {Does not play video games,
HAS CANCER} => {HAS HYPERTENSION} 0.002445486 0.6153846 11.982906 24
[13] {Black,
HAS CANCER} => {HAS HYPERTENSION} 0.001528429 0.8333333 16.226852 15
[14] {Drinks milk multiple times a week,
HAS CANCER} => {HAS HYPERTENSION} 0.002037905 0.6250000 12.170139 20
[15] {HAS CANCER,
No smokers present in house} => {HAS HYPERTENSION} 0.004177705 0.5774648 11.244523 41
[16] {HAS CANCER,
Mostly visits a clinic or health center for healthcare} => {HAS HYPERTENSION} 0.003872020 0.5846154 11.383761 38
[17] {HAS CANCER,
Has rode in a vehicle within the past 7 days} => {HAS HYPERTENSION} 0.004279601 0.6000000 11.683333 42
[18] {HAS CANCER,
NO DIABETES} => {HAS HYPERTENSION} 0.003056858 0.5357143 10.431548 30
[19] {HAS CANCER,
White} => {HAS HYPERTENSION} 0.002343591 0.5609756 10.923442 23
[20] {HAS CANCER,
Male} => {HAS HYPERTENSION} 0.002751172 0.6279070 12.226744 27
[21] {Female,
HAS CANCER} => {HAS HYPERTENSION} 0.002649277 0.5652174 11.006039 26
[22] {Drinks milk multiple times a day,
HAS CANCER} => {HAS HYPERTENSION} 0.002343591 0.6571429 12.796032 23
[23] {HAS CANCER,
Married} => {HAS HYPERTENSION} 0.002037905 0.5128205 9.985755 20
[24] {Born in US,
HAS CANCER} => {HAS HYPERTENSION} 0.004483391 0.6111111 11.899691 44
[25] {HAS CANCER,
Has health insurance coverage} => {HAS HYPERTENSION} 0.004890972 0.6153846 11.982906 48
[26] {HAS CANCER,
Right-handed} => {HAS HYPERTENSION} 0.004890972 0.6075949 11.831224 48
[27] {HAS CANCER,
Has not requested emergency food assistance} => {HAS HYPERTENSION} 0.003973915 0.6093750 11.865885 39
[28] {HAS CANCER,
US citizen} => {HAS HYPERTENSION} 0.005094763 0.6250000 12.170139 50
[29] {No health insurance coverage,
Plays 4 hours of video games over the past 30 days} => {HAS HYPERTENSION} 0.001222743 0.4285714 8.345238 12
[30] {Has rode in a vehicle within the past 7 days,
Under $20000} => {HAS HYPERTENSION} 0.001018953 0.4166667 8.113426 10
[31] {HAS DIABETES,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.001324638 0.6842105 13.323099 13
[32] {Does not play video games,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.001732219 0.4473684 8.711257 17
[33] {Black,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.001936010 0.5277778 10.277006 19
[34] {Drinks milk multiple times a week,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.001630324 0.6400000 12.462222 16
[35] {Mostly visits outpatient departments for healthcare,
No smokers present in house} => {HAS HYPERTENSION} 0.002853067 0.4000000 7.788889 28
[36] {Has rode in a vehicle within the past 7 days,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.003566334 0.4117647 8.017974 35
[37] {Mostly visits outpatient departments for healthcare,
White} => {HAS HYPERTENSION} 0.001222743 0.5000000 9.736111 12
[38] {Male,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.002853067 0.4666667 9.087037 28
[39] {Born in US,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.003260648 0.4324324 8.420420 32
[40] {Has health insurance coverage,
Mostly visits outpatient departments for healthcare} => {HAS HYPERTENSION} 0.003464439 0.4197531 8.173525 34
[41] {Mostly visits outpatient departments for healthcare,
Right-handed} => {HAS HYPERTENSION} 0.003362543 0.4074074 7.933128 33
[42] {Mostly visits outpatient departments for healthcare,
US citizen} => {HAS HYPERTENSION} 0.003872020 0.4222222 8.221605 38
[43] {Plays 3 hours of video games over the past 30 days,
Rarely-less than once a week} => {HAS HYPERTENSION} 0.001120848 0.4400000 8.567778 11
[44] {HAS DIABETES,
Mostly visits an emergency room for healthcare} => {HAS HYPERTENSION} 0.001120848 0.5500000 10.709722 11
[45] {$5000 - $9999,
HAS DIABETES} => {HAS HYPERTENSION} 0.001222743 0.7500000 14.604167 12
[46] {HAS DIABETES,
Widowed} => {HAS HYPERTENSION} 0.002037905 0.7692308 14.978632 20
[47] {HAS DIABETES,
Plays 1 hour of video games over the past 30 days} => {HAS HYPERTENSION} 0.001120848 0.6875000 13.387153 11
[48] {HAS DIABETES,
Has not rode in a vehicle within the past 7 days} => {HAS HYPERTENSION} 0.002751172 0.6428571 12.517857 27
[49] {$10000 - $14999,
HAS DIABETES} => {HAS HYPERTENSION} 0.001426534 0.7368421 14.347953 14
[50] {$15000 - $19999,
HAS DIABETES} => {HAS HYPERTENSION} 0.001426534 0.7000000 13.630556 14
[51] {HAS DIABETES,
Household income between $45000 - $54999} => {HAS HYPERTENSION} 0.001018953 0.7692308 14.978632 10
[52] {HAS DIABETES,
Plays less than an hours of video games over the past 30 days} => {HAS HYPERTENSION} 0.001834115 0.7200000 14.020000 18
[53] {HAS DIABETES,
Smokers present in house} => {HAS HYPERTENSION} 0.003260648 0.6666667 12.981481 32
[54] {HAS DIABETES,
Household income between $35000 - $44999} => {HAS HYPERTENSION} 0.001018953 0.7142857 13.908730 10
[55] {HAS DIABETES,
Household income between $20000 - $24999} => {HAS HYPERTENSION} 0.002649277 0.8387097 16.331541 26
[56] {HAS DIABETES,
Other_Hispanic} => {HAS HYPERTENSION} 0.001630324 0.5333333 10.385185 16
[57] {HAS DIABETES,
Other} => {HAS HYPERTENSION} 0.001630324 0.4444444 8.654321 16
[58] {Divorced,
HAS DIABETES} => {HAS HYPERTENSION} 0.002343591 0.6764706 13.172386 23
[59] {HAS DIABETES,
Has requested emergency food assistance} => {HAS HYPERTENSION} 0.003158753 0.7209302 14.038114 31
[60] {Does not play video games,
HAS DIABETES} => {HAS HYPERTENSION} 0.007845934 0.6581197 12.815052 77
[61] {HAS DIABETES,
Rarely-less than once a week} => {HAS HYPERTENSION} 0.002241696 0.6111111 11.899691 22
[62] {Does not drink milk,
HAS DIABETES} => {HAS HYPERTENSION} 0.002241696 0.5945946 11.578078 22
[63] {HAS DIABETES,
Never_married} => {HAS HYPERTENSION} 0.001528429 0.6250000 12.170139 15
[64] {HAS DIABETES,
No health insurance coverage} => {HAS HYPERTENSION} 0.002649277 0.4905660 9.552411 26
[65] {HAS DIABETES,
Mexican_American} => {HAS HYPERTENSION} 0.003566334 0.6140351 11.956628 35
[66] {Born outside of US,
HAS DIABETES} => {HAS HYPERTENSION} 0.004381496 0.5243902 10.211043 43
[67] {Black,
HAS DIABETES} => {HAS HYPERTENSION} 0.004177705 0.8200000 15.967222 41
[68] {Drinks milk multiple times a week,
HAS DIABETES} => {HAS HYPERTENSION} 0.004483391 0.7333333 14.279630 44
[69] {HAS DIABETES,
No smokers present in house} => {HAS HYPERTENSION} 0.009781944 0.6400000 12.462222 96
[70] {HAS DIABETES,
Mostly visits a clinic or health center for healthcare} => {HAS HYPERTENSION} 0.010189525 0.6666667 12.981481 100
[71] {HAS DIABETES,
Has rode in a vehicle within the past 7 days} => {HAS HYPERTENSION} 0.010393316 0.6496815 12.650743 102
[72] {HAS DIABETES,
NO CANCER} => {HAS HYPERTENSION} 0.010800897 0.6385542 12.434070 106
[73] {HAS DIABETES,
White} => {HAS HYPERTENSION} 0.002649277 0.6341463 12.348238 26
[74] {HAS DIABETES,
Male} => {HAS HYPERTENSION} 0.006215610 0.5980392 11.645153 61
[75] {Female,
HAS DIABETES} => {HAS HYPERTENSION} 0.006928877 0.7010309 13.650630 68
[76] {Drinks milk multiple times a day,
HAS DIABETES} => {HAS HYPERTENSION} 0.004177705 0.6212121 12.096380 41
[77] {HAS DIABETES,
Married} => {HAS HYPERTENSION} 0.005706134 0.5894737 11.478363 56
[78] {Born in US,
HAS DIABETES} => {HAS HYPERTENSION} 0.008762992 0.7350427 14.312915 86
[79] {HAS DIABETES,
Has health insurance coverage} => {HAS HYPERTENSION} 0.010495211 0.7054795 13.737253 103
[80] {HAS DIABETES,
Right-handed} => {HAS HYPERTENSION} 0.011717954 0.6460674 12.580368 115
[81] {HAS DIABETES,
Has not requested emergency food assistance} => {HAS HYPERTENSION} 0.009985735 0.6282051 12.232550 98
[82] {HAS DIABETES,
US citizen} => {HAS HYPERTENSION} 0.011514163 0.6975309 13.582476 113
[83] {Has not rode in a vehicle within the past 7 days,
Widowed} => {HAS HYPERTENSION} 0.001120848 0.4400000 8.567778 11
[84] {Plays less than an hours of video games over the past 30 days,
Widowed} => {HAS HYPERTENSION} 0.001426534 0.4827586 9.400383 14
[85] {Does not play video games,
Widowed} => {HAS HYPERTENSION} 0.002954962 0.5471698 10.654612 29
[86] {No smokers present in house,
Widowed} => {HAS HYPERTENSION} 0.004585286 0.4945055 9.629121 45
[87] {Mostly visits a clinic or health center for healthcare,
Widowed} => {HAS HYPERTENSION} 0.004075810 0.4597701 8.952746 40
[88] {Has rode in a vehicle within the past 7 days,
Widowed} => {HAS HYPERTENSION} 0.004381496 0.4526316 8.813743 43
[89] {NO CANCER,
Widowed} => {HAS HYPERTENSION} 0.004381496 0.4134615 8.051015 43
>
CANCER (large itemsets)
CANCER (small itemsets)
DIABETES (large itemsets)
DIABETES (small itemsets)
HYPER TENSION (large itemsets)
HYPER TENSION (small itemsets)
CANCER (large itemset)
CANCER (small itemset)
DIABETES (large itemset)
DIABETES (small itemset)
“hypertension_graph_large.html”
HYPERTENSION (large itemset)
HYPERTENSION (small itemset)
Using top 20 rules for Cancer
Using top 20 rules for diabetes
Using top 20 rules for hypertension
In the precedin section, we looked at associations for having diseases. For this section, we look at the inverse, could their value in associations where there is no diseases/health conditions?
Target attributes need to be added .
As part of the business problem, we focusing on 3 targets(diabetes, hypertension, cancer):
#DIQ010 - Doctor told you have diabetes
#https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DIQ_H.htm
#The next questions are about specific medical conditions. {Other than during pregnancy, {have you/has SP}/{Have you/Has SP}} ever been told by a doctor or health professional that {you have/{he/she/SP} has} diabetes or sugar diabetes?
# BPQ020 - Ever told you had high blood pressure
# https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/BPQ_H.htm
# {Have you/Has SP} ever been told by a doctor or other health professional that {you/s/he} had hypertension, also called high blood pressure?
# MCQ220 - Ever told you had cancer or malignancy
# https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/MCQ_H.htm#MCQ220
# {Have you/Has SP} ever been told by a doctor or other health professional that {you/s/he} had cancer or a malignancy (ma-lig-nan-see) of any kind?
# Create the target dataset for the Supervised problem.
temp_questionnaire = read.csv("Data/Raw/questionnaire.csv", header = TRUE, na.strings = c("NA","","#NA"))
target_columns <- c("SEQN","DIQ010","BPQ020","MCQ220")
target_disease_dataset = subset(temp_questionnaire, select=target_columns)
# Change disease indicators into factors
target_disease_dataset$MCQ220 <- as.factor(target_disease_dataset$MCQ220)
target_disease_dataset$DIQ010 <- as.factor(target_disease_dataset$DIQ010)
target_disease_dataset$BPQ020 <- as.factor(target_disease_dataset$BPQ020)
#Create new column for target values
target_disease_dataset = cbind(target_disease_dataset, HAS_DIABETES= ifelse(target_disease_dataset$DIQ010 == 1, "YES", "NO" ) )
target_disease_dataset= cbind(target_disease_dataset, HAS_HYPERTENSION= ifelse(target_disease_dataset$BPQ020 == 1, "YES", "NO" ) )
target_disease_dataset = cbind(target_disease_dataset, HAS_CANCER= ifelse(target_disease_dataset$MCQ220 == 1, "YES", "NO" ) )
summary(target_disease_dataset)
# With new target values, set "NA" to "NO"
target_disease_dataset$HAS_DIABETES[is.na(target_disease_dataset$HAS_DIABETES)] <- "NO"
target_disease_dataset$HAS_HYPERTENSION[is.na(target_disease_dataset$HAS_HYPERTENSION)] <- "NO"
target_disease_dataset$HAS_CANCER[is.na(target_disease_dataset$HAS_CANCER)] <- "NO"
summary(target_disease_dataset)
Given an individual has diabetes, predict individual has cancer or hypertension. Use the less amount of data possible to keep costs low.
Choose attributes to choose the target.
combined_target_final <- read_csv("Data/Target Datasets/combined_target_final.csv")
library(devtools)
library(ggbiplot)
combined_target_final.pca <- prcomp(combined_target_final[,c(4:38)], center = TRUE,scale = TRUE)
summary(combined_target_final.pca)
screeplot(combined_target_final.pca, type = "l", npcs = 20, main = "Screeplot of the first 20 PCs")
abline(h = 1, col="red", lty=5)
legend("topright", legend=c("Eigenvalue = 1"),
col=c("red"), lty=5, cex=0.6)
cumpro <- cumsum(combined_target_final.pca$sdev^2 / sum(combined_target_final.pca$sdev^2))
plot(cumpro[0:20], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(v = 9, col="blue", lty=5)
abline(h = 0.79850, col="blue", lty=5)
legend("topleft", legend=c("Cut-off @ PC9"),
col=c("blue"), lty=5, cex=0.6)
We notice is that the first 9 components has an Eigenvalue >1 and explains almost 80% of variance. So if wereduce dimensionality from 35 to 8 we will lose 20% of variance!
The two first components explains only 30% of the variance. We need 18 principal components to explain more than 95% of the variance and 27 to explain more than 0.99
fitControl <- trainControl(method="cv",
number = 5,
preProcOptions = list(thresh = 0.99), # threshold for pca preprocess
classProbs = TRUE,
summaryFunction = twoClassSummary)
We are going to create a training and test set of these data:
combined_target_final <- read_csv("Data/Target Datasets/combined_target_final.csv")
require(caret)
require(dplyr)
require(caretEnsemble)
require(pROC)
set.seed(101)
data_index <- createDataPartition(combined_target_final$TARGET, p=0.75, list = FALSE)
train_Combined <- combined_target_final[data_index,-c(1,2,39,40,41,42) ]
test_Combined <- combined_target_final[-data_index, -c(1,2,39,40,41,42)]
#try to predict class probabilities in R - caret
levels(train_Combined$TARGET) <- make.names(levels(factor(train_Combined$TARGET)))
levels(test_Combined$TARGET) <- make.names(levels(factor(test_Combined$TARGET)))
#train_Combined$TARGET = as.factor(train_Combined$TARGET)
#test_Combined$TARGET = as.factor(test_Combined$TARGET)
Let’s try Logistic Regression:
model_lr <- train(TARGET~.,train_Combined,
method = "glmnet",
metric="ROC",
#tuneGrid = expand.grid(alpha = c(0, .1, .2, .4, .6, .8, 1),lambda = seq(.01, .2, length = 20)),
preProcess = c("center", "scale"),
trControl=fitControl)
pred_lr <- predict(model_lr, test_Combined)
cm_lr <- confusionMatrix(pred_lr, test_Combined$TARGET, positive = "X1")
cm_lr
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 13800 0
X1 0 8006
Accuracy : 1
95% CI : (0.9998, 1)
No Information Rate : 0.6329
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.3671
Detection Rate : 0.3671
Detection Prevalence : 0.3671
Balanced Accuracy : 1.0000
'Positive' Class : X1
Let’s try random forest:
model_rf <- train(as.factor(TARGET)~.,
train_Combined,
method="ranger",
metric="ROC",
#tuneLength=10,
#tuneGrid = expand.grid(mtry = c(2, 3, 6)),
#tuneGrid = data.frame(mtry = 3)
tuneGrid = tunegrid,
preProcess = c('center', 'scale'),
trControl=fitControl)
pred_rf <- predict(model_rf, test_Combined)
cm_rf <- confusionMatrix(pred_rf, test_Combined$TARGET, positive = "X1")
cm_rf
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 13800 2
X1 0 8004
Accuracy : 0.9999
95% CI : (0.9997, 1)
No Information Rate : 0.6329
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9998
Mcnemar's Test P-Value : 0.4795
Sensitivity : 0.9998
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9999
Prevalence : 0.3671
Detection Rate : 0.3671
Detection Prevalence : 0.3671
Balanced Accuracy : 0.9999
'Positive' Class : X1
Random forest with pca
model_pca_rf <- train(TARGET~.,
train_Combined,
method="ranger",
metric="ROC",
#tuneLength=10,
#tuneGrid = expand.grid(mtry = c(2, 3, 6)),
preProcess = c('center', 'scale', 'pca'),
trControl=fitControl)
pred_pca_rf <- predict(model_pca_rf, test_Combined)
cm_pca_rf <- confusionMatrix(pred_pca_rf, test_Combined$TARGET, positive = "X1")
cm_pca_rf
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 13799 1
X1 1 8005
Accuracy : 0.9999
95% CI : (0.9997, 1)
No Information Rate : 0.6329
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9998
Mcnemar's Test P-Value : 1
Sensitivity : 0.9999
Specificity : 0.9999
Pos Pred Value : 0.9999
Neg Pred Value : 0.9999
Prevalence : 0.3671
Detection Rate : 0.3671
Detection Prevalence : 0.3671
Balanced Accuracy : 0.9999
'Positive' Class : X1
Let’s try KNN model
model_knn <- train(TARGET~.,
train_Combined,
method="knn",
metric="ROC",
preProcess = c('center', 'scale'),
tuneLength=10,
trControl=fitControl)
pred_knn <- predict(model_knn, test_Combined)
cm_knn <- confusionMatrix(pred_knn, test_Combined$TARGET, positive = "X1")
> cm_knn
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 13754 77
X1 46 7929
Accuracy : 0.9944
95% CI : (0.9933, 0.9953)
No Information Rate : 0.6329
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.9879
Mcnemar's Test P-Value : 0.00683
Sensitivity : 0.9904
Specificity : 0.9967
Pos Pred Value : 0.9942
Neg Pred Value : 0.9944
Prevalence : 0.3671
Detection Rate : 0.3636
Detection Prevalence : 0.3657
Balanced Accuracy : 0.9935
'Positive' Class : X1
pred_prob_knn <- predict(model_knn, test_Combined, type="prob")
roc_knn <- roc(test_Combined$TARGET, pred_prob_knn$X1)
plot(roc_knn)
model_svm <- train(TARGET~.,
train_Combined,
method="svmRadial",
metric="ROC",
preProcess=c('center', 'scale'),
trace=FALSE,
trControl=fitControl)
pred_svm <- predict(model_svm, test_Combined)
cm_svm <- confusionMatrix(pred_svm, test_Combined$TARGET, positive = "X1")
> cm_svm
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 13800 1
X1 0 8005
Accuracy : 1
95% CI : (0.9997, 1)
No Information Rate : 0.6329
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9999
Mcnemar's Test P-Value : 1
Sensitivity : 0.9999
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9999
Prevalence : 0.3671
Detection Rate : 0.3671
Detection Prevalence : 0.3671
Balanced Accuracy : 0.9999
'Positive' Class : X1
Let’s compare the models and check their correlation:
model_list <- list(LR= model_rf, RF=model_rf, PCA_RF=model_pca_rf, KNN = model_knn, SVM=model_svm)
resamples <- resamples(model_list)
model_cor <- modelCor(resamples)
corrplot(model_cor)
model_cor
LR RF PCA_RF KNN SVM
LR 1 NA NA NA NA
RF NA 1 NA NA NA
PCA_RF NA NA 1.0000000 0.4587672 -0.4482732
KNN NA NA 0.4587672 1.0000000 -0.5315354
SVM NA NA -0.4482732 -0.5315354 1.0000000
bwplot(resamples, metric="ROC")
We see here that KNN have a great variability depending of the processed sample. Almost all the models achieve a great auc with some variability.
cm_list <- list(LR= model_rf, RF=model_rf, PCA_RF=model_pca_rf, KNN = model_knn, SVM=model_svm)
Let’s remember how these models result with the testing dataset. Prediction classes are obtained by default with a threshold of 0.5 which could not be the best with an unbalanced dataset like this.
cm_list <- list(LR= cm_lr, RF=cm_rf, PCA_RF=cm_pca_rf, KNN = cm_knn, SVM=cm_svm)
cm_list_results <- sapply(cm_list, function(x) x$byClass)
cm_list_results
LR RF PCA_RF KNN SVM
Sensitivity 1.0000000 0.9997502 0.9998751 0.9903822 0.9998751
Specificity 1.0000000 1.0000000 0.9999275 0.9966667 1.0000000
Pos Pred Value 1.0000000 1.0000000 0.9998751 0.9942320 1.0000000
Neg Pred Value 1.0000000 0.9998551 0.9999275 0.9944328 0.9999275
Precision 1.0000000 1.0000000 0.9998751 0.9942320 1.0000000
Recall 1.0000000 0.9997502 0.9998751 0.9903822 0.9998751
F1 1.0000000 0.9998751 0.9998751 0.9923034 0.9999375
Prevalence 0.3671467 0.3671467 0.3671467 0.3671467 0.3671467
Detection Rate 0.3671467 0.3670549 0.3671008 0.3636155 0.3671008
Detection Prevalence 0.3671467 0.3670549 0.3671467 0.3657250 0.3671008
Balanced Accuracy 1.0000000 0.9998751 0.9999013 0.9935244 0.9999375
The best results for sensitivity (detection of breast cases) is LDA_NNET
The best results for sensitivity (detection of diabetes) is Logistic regression which also has a great F1 score.
Random forest with PCA is as sensitive as Random forest
require(nnet)
cm_results_max <- apply(cm_list_results, 1, which.is.max)
output_report <- data.frame(metric=names(cm_results_max),
best_model=colnames(cm_list_results)[cm_results_max],
value=mapply(function(x,y) {cm_list_results[x,y]},
names(cm_results_max),
cm_results_max))
rownames(output_report) <- NULL
output_report
metric best_model value
1 Sensitivity LR 1.0000000
2 Specificity RF 1.0000000
3 Pos Pred Value RF 1.0000000
4 Neg Pred Value LR 1.0000000
5 Precision LR 1.0000000
6 Recall LR 1.0000000
7 F1 LR 1.0000000
8 Prevalence SVM 0.3671467
9 Detection Rate LR 0.3671467
10 Detection Prevalence LR 0.3671467
11 Balanced Accuracy LR 1.0000000
We have found Logistic regression model preprocessed data with good results over the test set. This model has a sensibility of 1.00 with a F1 score of 1.00.
The other models also have a sensibility greather than 0.990.
With our models, we aim to present the following to the business regarding applications of our models to assist with drug deployment:
Firstly, for different diseases, our models can help identify related attributes/values (clusters) for a given disease.
This will help select suitable patients for trials. Trial process can be costly. And we reduce the cost. Additionally, this information can also be used to provide hints to researchers on drug treatments on the what might could helpful.
(PROVIDE EXAMPLE WHEN AVAILABLE)
For different diseases, our app can different types of non-medical (diet/demographics) which are related.
- This can help with directed marketing of our current and future drugs for those diseases.
- This could also assist with marketing to attract potential candidates for trails.
(PROVIDE EXAMPLE WHEN AVAILABLE)